Scoreboard with register renaming

infinis...@gmail.com

unread,

Apr 28, 2018, 3:54:17 AM4/28/18

to

I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming? It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

Quadibloc

unread,

Apr 28, 2018, 2:46:46 PM4/28/18

to

I'm not sure I understand you. What you're describing sounds like what
everyone usually does when making an out-of-order computer, the
reservation stations approach being mostly unique to the 91, 95, and 195.

adaptive...@gmail.com

unread,

Apr 28, 2018, 2:53:29 PM4/28/18

to

On Saturday, April 28, 2018 at 4:54:17 PM UTC+9, infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming? It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

Scoreboarding with register renaming were there, you can check history of microprocessor out with googling.

Best,
S.Takano

infinis...@gmail.com

unread,

Apr 28, 2018, 4:37:32 PM4/28/18

to

I'm not really a CPU historian other than intel/amd commercial cpu's, and whatever I happen to come by when learning about designs. When I look up dynamic scheduling/out of order execution I came across the CDC 6600 and the system 360/91, the first being the first example of scoreboarding and the second being the first example of 'tomasulo'. When I google scoreboarding with register renaming I just get explanations of the CDC6600 implementation. If you could point out a particular processor I could look up it would be greatly appreciated. As to what most out of order cpu's use... from what I've seen of commercial out of order cpu's (x86,PPC,arm) they are variations of Tomasulo.

infinis...@gmail.com

unread,

Apr 28, 2018, 4:44:51 PM4/28/18

to

I googled 'history of the microprocessor' 'history of the CPU' and 'history of mainframe cpu' all that came up were intel/x86 centric histories. Do you have a better search term?

EricP

unread,

Apr 29, 2018, 12:18:57 PM4/29/18

to

infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming? It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

The word scoreboard is quite overloaded, so you need to be more
explicit in describing what microarchitecture you are thinking about.
When a processor uses a "scoreboard" you really have to look
at exactly how it works to see how it is the same or different.

A CDC 6600 style uArch managed by its scoreboard is
- in order issue, out-of-order execute, out-of-order complete
- with RAW hazards handled by stall in the register read stage
- with WAR hazards handled by stall in the write-back stage
- with WAW hazards handled by stall in the issue stage
- stall for available Function Unit in issue stage
- no forwarding

One important consequence of OoO complete is imprecise exceptions.
e.g.

div r0, r1, r2 ; r0 = r1/r2
add r3, r4, r5 ; r3 = r4+r5

the above add can issue and complete and update r3,
the later the divider unit notices an exception, say underflow,
and throws an exception aborting the update to r0.
So r0 is unchanged but the later r3 is changed.

Adding a renamer in the instruction decode stage:
- eliminates WAR and WAR hazard detection and it becomes
subsumed by a more general resource availability check
for a free physical register.
Note that physical register status now tracks pending reads,
so a physical register is free if it is:
not the architecture register, not busy, and no reads pending.
- allows rollback to prior committed architected state
implementing precise exceptions.

It also needs some logic for rollback to a prior
architected register set, and to track and commit.

It the end we have:
- in order issue, out-of-order execute, out-of-order complete
- with RAW hazards handled by stall in the register read stage
- stall for free physical register in the instruction decode stage
- stall for available FU in issue stage
- no forwarding
- precise interrupts

Now the question becomes, once you have done all this,
how did the performance change?
Is having precise interrupts sufficient cause to
justify adding renaming?

Reservation stations (with renamer) allows distribution of the
pending instructions from the decoder stage to the various stations
so you don't stall at the issue stage for an available FU.
Instead each RS tracks its own FU and data availability,
and does its own OoO scheduling and OoO complete.
But now you have to figure out how to get all the necessary
information and data to each RS.
One question, for example, is how many registers should each RS have?
Too few and you stall at decode, too many and they sit idle.

Eric

infinis...@gmail.com

unread,

Apr 29, 2018, 3:00:09 PM4/29/18

to

On Sunday, April 29, 2018 at 12:18:57 PM UTC-4, EricP wrote:
>
> The word scoreboard is quite overloaded, so you need to be more
> explicit in describing what microarchitecture you are thinking about.
> When a processor uses a "scoreboard" you really have to look
> at exactly how it works to see how it is the same or different.

I was using the term scoreboarding both loosely and specifically. How come both... I was contemplating centralized control circuitry vs the distributed method found in Tomasulo. So there is both the possibility of scoreboarding with register renaming, and there are other possibilities.

> Adding a renamer in the instruction decode stage:
> - eliminates WAR and WAR hazard detection and it becomes
> subsumed by a more general resource availability check
> for a free physical register.
> Note that physical register status now tracks pending reads,
> so a physical register is free if it is:
> not the architecture register, not busy, and no reads pending.
> - allows rollback to prior committed architected state
> implementing precise exceptions.

First you wrote WAR and WAR, I assume you meant WAR and WAW. Second I thought the renamer would be relevant in the issue and read operands stage since there is no instruction decode stage in a scoreboard.

> It the end we have:
> - in order issue, out-of-order execute, out-of-order complete
> - with RAW hazards handled by stall in the register read stage
> - stall for free physical register in the instruction decode stage
> - stall for available FU in issue stage
> - no forwarding
> - precise interrupts

I thought reorder buffers allowed for precise interrupts. Register renaming alone does not. In fact you state out of order completion which implies imprecise interrupts.

> Now the question becomes, once you have done all this,
> how did the performance change?
> Is having precise interrupts sufficient cause to
> justify adding renaming?

Again you're not adding precise interrupts with register renaming but you are getting rid of the stalls caused by RAW and WAW hazards. Which are the primary reasons Tomasulo is abstractly better than Scoreboarding.

>
> Reservation stations (with renamer) allows distribution of the
> pending instructions from the decoder stage to the various stations
> so you don't stall at the issue stage for an available FU.
> Instead each RS tracks its own FU and data availability,
> and does its own OoO scheduling and OoO complete.
> But now you have to figure out how to get all the necessary
> information and data to each RS.
> One question, for example, is how many registers should each RS have?
> Too few and you stall at decode, too many and they sit idle.
>
> Eric

I understand the purpose of reservation stations but I wonder if this distributed design (physically) has a positive impact on clock speed vs a centralized solution. Also the only reason the scoreboard in the 6600 stalled if a FU was busy was that the FU's weren't pipelined. That should be easily remedied. In fact IIRC the successor to the 6600 the 7600(???) implemented pipelined FU's.

EricP

unread,

Apr 29, 2018, 4:50:41 PM4/29/18

to

infinis...@gmail.com wrote:
> On Sunday, April 29, 2018 at 12:18:57 PM UTC-4, EricP wrote:
>> The word scoreboard is quite overloaded, so you need to be more
>> explicit in describing what microarchitecture you are thinking about.
>> When a processor uses a "scoreboard" you really have to look
>> at exactly how it works to see how it is the same or different.
>
> I was using the term scoreboarding both loosely and specifically. How come both... I was contemplating centralized control circuitry vs the distributed method found in Tomasulo. So there is both the possibility of scoreboarding with register renaming, and there are other possibilities.
>
>
>> Adding a renamer in the instruction decode stage:
>> - eliminates WAR and WAR hazard detection and it becomes
>> subsumed by a more general resource availability check
>> for a free physical register.
>> Note that physical register status now tracks pending reads,
>> so a physical register is free if it is:
>> not the architecture register, not busy, and no reads pending.
>> - allows rollback to prior committed architected state
>> implementing precise exceptions.
>
> First you wrote WAR and WAR, I assume you meant WAR and WAW. Second I thought the renamer would be relevant in the issue and read operands stage since there is no instruction decode stage in a scoreboard.

Yes, that was a typo - should have been WAR and WAW.

Sure there is an instruction decoder for a scoreboarded cpu.
The scoreboard is a a kind of scheduler, it tracks dependencies
and decides when to issue to the FU's.

The renamer removes architectural register dependencies
but it also separates the execute completion from state commit.

The renamer contains two sets of maps from architecture to
physical registers, the future set and the committed set.
At state commit, the committed map gets updated with the
latest physical register. If something goes wrong, the committed
set can be copied into the future set effecting a rollback.

>> It the end we have:
>> - in order issue, out-of-order execute, out-of-order complete
>> - with RAW hazards handled by stall in the register read stage
>> - stall for free physical register in the instruction decode stage
>> - stall for available FU in issue stage
>> - no forwarding
>> - precise interrupts
>
> I thought reorder buffers allowed for precise interrupts. Register renaming alone does not. In fact you state out of order completion which implies imprecise interrupts.

To me it is the renamer that records the committed vs future state,
and that is what creates precise interrupts.

Or to put it a different way, one could build a cpu with
a ROB and OoO scheduler and completion but _without_ a renamer.
It would NOT have precise interrupts.

I am trying to look at each potential change in isolation.
What is required to add JUST a renamer?
What is required for JUST OoO issuing and completion?
What is required for JUST forwarding?

A ROB is typically also associated with Out-of-Order issuing
and has other features associated with it,
and I was trying to separate out the changes for just adding
a renamer while retaining the original in-order issue.

But yes, adding the renamer would need something to trigger
the renamer state commit. I wanted something very simple.
I thought about various shift registers but none really worked out.
The simplest I can think of is a circular buffer with just
an architecture register#, a physical register#, and a Done flag.
It didn't seem fair to call that a ROB though that is partly its job.

>> Now the question becomes, once you have done all this,
>> how did the performance change?
>> Is having precise interrupts sufficient cause to
>> justify adding renaming?
>
> Again you're not adding precise interrupts with register renaming but you are getting rid of the stalls caused by RAW and WAW hazards. Which are the primary reasons Tomasulo is abstractly better than Scoreboarding.

Yes it does make interrupts precise if you add some minimal
support logic, like that circular buffer I mention above.
It is the renamer that allows the state commit vs rollback.

Tomasulo bypasses the in-order issue bottleneck with reservation stations.
It ALSO has a renamer and therefore a future vs committed state.

>> Reservation stations (with renamer) allows distribution of the
>> pending instructions from the decoder stage to the various stations
>> so you don't stall at the issue stage for an available FU.
>> Instead each RS tracks its own FU and data availability,
>> and does its own OoO scheduling and OoO complete.
>> But now you have to figure out how to get all the necessary
>> information and data to each RS.
>> One question, for example, is how many registers should each RS have?
>> Too few and you stall at decode, too many and they sit idle.
>>
>> Eric
> I understand the purpose of reservation stations but I wonder if this distributed design (physically) has a positive impact on clock speed vs a centralized solution. Also the only reason the scoreboard in the 6600 stalled if a FU was busy was that the FU's weren't pipelined. That should be easily remedied. In fact IIRC the successor to the 6600 the 7600(???) implemented pipelined FU's.

From a concurrency tracking point of view you can have
multiple FU's, or pipelined FU's, or multiple pipelined FU's.
Both have multiple instructions in flight.
But a pipeline can only start or complete one instruction per clock,
while multiple FU's start or complete multiple instructions per clock
and so can have more port and bus resource contentions.

Eric

infinis...@gmail.com

unread,

Apr 29, 2018, 6:16:00 PM4/29/18

to

On Sunday, April 29, 2018 at 4:50:41 PM UTC-4, EricP wrote:

> > First you wrote WAR and WAR, I assume you meant WAR and WAW. Second I thought the renamer would be relevant in the issue and read operands stage since there is no instruction decode stage in a scoreboard.
>
> Yes, that was a typo - should have been WAR and WAW.
>
> Sure there is an instruction decoder for a scoreboarded cpu.
> The scoreboard is a a kind of scheduler, it tracks dependencies
> and decides when to issue to the FU's.

Of course there is an instruction decoder but it a part of the issue stage. (at least in the 6600)

>
> The renamer removes architectural register dependencies
> but it also separates the execute completion from state commit.

With renaming with no ROB or history buffer or ...alt method to deal with per inflight Instruction pointer state (i'm assuming no speculative hardware, I don't remember how rollback of spec ex occurs but IIRC its usually with a ROB) the processor would only have access to the most recent state being considered by the hardware.

>
> The renamer contains two sets of maps from architecture to
> physical registers, the future set and the committed set.
> At state commit, the committed map gets updated with the
> latest physical register. If something goes wrong, the committed
> set can be copied into the future set effecting a rollback.

You would need multiple mappings just for the future set since you would need processor context on a per inflight Instruction pointer. And once committed you can't rollback, thats the definition of committed i.e. you're sure this state is coherent. (at least in the case of inorder retirement, which is necessary for precise interrupts)

>
> >> It the end we have:
> >> - in order issue, out-of-order execute, out-of-order complete
> >> - with RAW hazards handled by stall in the register read stage
> >> - stall for free physical register in the instruction decode stage
> >> - stall for available FU in issue stage
> >> - no forwarding
> >> - precise interrupts
> >
> > I thought reorder buffers allowed for precise interrupts. Register renaming alone does not. In fact you state out of order completion which implies imprecise interrupts.
>
> To me it is the renamer that records the committed vs future state,
> and that is what creates precise interrupts.
>
> Or to put it a different way, one could build a cpu with
> a ROB and OoO scheduler and completion but _without_ a renamer.
> It would NOT have precise interrupts.

Could you build an OoO with an ROB but without renaming? That doesn't seem feasible. Also since when does a renamer keep track of per instruction state.

>
> I am trying to look at each potential change in isolation.
> What is required to add JUST a renamer?
> What is required for JUST OoO issuing and completion?
> What is required for JUST forwarding?
>
> A ROB is typically also associated with Out-of-Order issuing
> and has other features associated with it,
> and I was trying to separate out the changes for just adding
> a renamer while retaining the original in-order issue.

A ROB is associated with OoO retirement not issuing. And moreover it enables in order commit. In using a ROB you commit state changes in order... in doing so you don't need to keep a full context per instruction you just need to keep track of what each instruction does to the state.

>
> But yes, adding the renamer would need something to trigger
> the renamer state commit. I wanted something very simple.
> I thought about various shift registers but none really worked out.
> The simplest I can think of is a circular buffer with just
> an architecture register#, a physical register#, and a Done flag.
> It didn't seem fair to call that a ROB though that is partly its job.

I haven't delved into ROB's in a while but IIRC your 'circular buffer' sounds like it supposed to do what an ROB does. But you never mentioned a circular buffer in your original post you only mentioned the renamer.

>
> >> Now the question becomes, once you have done all this,
> >> how did the performance change?
> >> Is having precise interrupts sufficient cause to
> >> justify adding renaming?
> >
> > Again you're not adding precise interrupts with register renaming but you are getting rid of the stalls caused by RAW and WAW hazards. Which are the primary reasons Tomasulo is abstractly better than Scoreboarding.
>
> Yes it does make interrupts precise if you add some minimal
> support logic, like that circular buffer I mention above.
> It is the renamer that allows the state commit vs rollback.

Like I said above you didn't mention a circular buffer in your original post and you're saying the circular buffer is responsible not the renamer.

>
> Tomasulo bypasses the in-order issue bottleneck with reservation stations.
> It ALSO has a renamer and therefore a future vs committed state.

Tomasulo issues in order, and has inprecise interrupts. Its only when you add additional hardware do you get precise interrupts.

MitchAlsup

unread,

Apr 29, 2018, 10:44:41 PM4/29/18

to

On Saturday, April 28, 2018 at 2:54:17 AM UTC-5, infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming?

The CDC 6600 scoreboard renamed registers into the function unit delivering the RAW result

The CDC 7600 scoreboard renamed registers into the function unit and pipeline number of the RAW result

Neither did register renaming.

> It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

I have a 140 page chapter in a comp-arch book discussing scoreboards. Find a way to e-mail me and I will send it to you.

It discusses adding forwarding to a scoreboard, having multiple writes to the same register, and advancing the state of the scoreboard art.

BTW: the ENTIRE CDC 6600 scoreboard was smaller than a single entry in the 360-91 reservation station scheme.

Quadibloc

unread,

Apr 29, 2018, 11:44:35 PM4/29/18

to

I may be all wrong, but to me he sounds like he is describing exactly
what everyone else is already doing: instead of reservation stations, they
use register renaming with centralized control, which is sort of
like a "scoreboard" - just not
exactly like the 6600, because now register renaming is added.

infinis...@gmail.com

unread,

Apr 30, 2018, 2:38:54 AM4/30/18

to

On Sunday, April 29, 2018 at 10:44:41 PM UTC-4, MitchAlsup wrote:
> On Saturday, April 28, 2018 at 2:54:17 AM UTC-5, infinis...@gmail.com wrote:
> > I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming?
>
> The CDC 6600 scoreboard renamed registers into the function unit delivering the RAW result
>

I thought the 6600 scorboard buffers/stalls in the read operand stage if the
operands aren't available?

> The CDC 7600 scoreboard renamed registers into the function unit and pipeline number of the RAW result
>
> Neither did register renaming.
>

If neither did register renaming, then why did you just say they renamed registers.

> > It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?
>
> I have a 140 page chapter in a comp-arch book discussing scoreboards. Find a way to e-mail me and I will send it to you.
>

and how would I do that?

> It discusses adding forwarding to a scoreboard, having multiple writes to the same register, and advancing the state of the scoreboard art.
>
> BTW: the ENTIRE CDC 6600 scoreboard was smaller than a single entry in the 360-91 reservation station scheme.

This seems impossible.

infinis...@gmail.com

unread,

Apr 30, 2018, 3:14:55 AM4/30/18

to

Who is doing this exactly?

Quadibloc

unread,

Apr 30, 2018, 3:39:32 AM4/30/18

to

Pretty much everyone who is making an OoO CPU. But I could be wrong.

John Savard

Stephen Fuld

unread,

Apr 30, 2018, 11:05:23 AM4/30/18

to

On 4/29/2018 7:44 PM, MitchAlsup wrote:

snip

> I have a 140 page chapter in a comp-arch book discussing scoreboards. Find a way to e-mail me and I will send it to you.
>
> It discusses adding forwarding to a scoreboard, having multiple writes to the same register, and advancing the state of the scoreboard art.

Is this a book that you are writing or an existing book by someone else?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Apr 30, 2018, 10:29:04 PM4/30/18

to

Apparently you aren't worth the trouble.

EricP

unread,

May 1, 2018, 12:36:29 PM5/1/18

to

infinis...@gmail.com wrote:
> On Sunday, April 29, 2018 at 4:50:41 PM UTC-4, EricP wrote:
>
>> The renamer removes architectural register dependencies
>> but it also separates the execute completion from state commit.
>
> With renaming with no ROB or history buffer or ...alt method to deal with per inflight Instruction pointer state (i'm assuming no speculative hardware, I don't remember how rollback of spec ex occurs but IIRC its usually with a ROB) the processor would only have access to the most recent state being considered by the hardware.

Ok... we are talking about different kinds of renamers.

I'm thinking of a single unified physical register file for
both future and committed results, with renaming using two
Register Alias Tables (RAT) for future FRAT and committed CRAT.
On commit the commit the CRAT map is updated.

You are thinking of the ROB storing future result registers
that are later copied to the architecture registers on commit.
The RAT then selects between ROB or architecture register sources.

(A third way is using separate physical file and architecture file,
and copying from physical to architecture on commit.)

The Design Space of Register Renaming Techniques, Sima 2000
https://classes.soe.ucsc.edu/cmpe202/Fall04/papers/rat.pdf

My original reason for looking at this was to see what it would take
to allow variable latency instructions, with out-of-order completion,
with precise interrupts, at low cost.
Keeping in-order issue (maybe dual issue but still in-order)
with a unified physical file, a dual RAT renamer and a
circular commit order buffer looks like it would work.

Lets say there is a 4-bit architecture register number (ARN),
and 6-bit physical register number (PRN).

In the unified register approach a RAM-based RAT uses the 4-bit ARN
to select a map entry and read out a 6-bit PRN. There are 2 RAT tables,
a Future RAT (FRAT) and Committed RAT (CRAT).
There are also wires running horizontally from CRAT to FRAT allowing
all future entries to be reset to their committed values to effect
a rollback in one clock.

The single physical register file holds both future and committed values.
Source operands can come from the issuing instruction, or from
the physical file, and results all go to the physical file.

That just leaves a simple circular Commit Order Buffer (COM) to
trigger CRAT updates on commit (and committed program counter, etc).
The commit write the committed ARN->PRN mapping into CRAT making
the future registers "real" and gives precise interrupts.

Eric

MitchAlsup

unread,

May 1, 2018, 12:47:00 PM5/1/18

to

On Tuesday, May 1, 2018 at 11:36:29 AM UTC-5, EricP wrote:
> infinis...@gmail.com wrote:
> > On Sunday, April 29, 2018 at 4:50:41 PM UTC-4, EricP wrote:
> >
> >> The renamer removes architectural register dependencies
> >> but it also separates the execute completion from state commit.
> >
> > With renaming with no ROB or history buffer or ...alt method to deal with per inflight Instruction pointer state (i'm assuming no speculative hardware, I don't remember how rollback of spec ex occurs but IIRC its usually with a ROB) the processor would only have access to the most recent state being considered by the hardware.
>
> Ok... we are talking about different kinds of renamers.
>
> I'm thinking of a single unified physical register file for
> both future and committed results, with renaming using two
> Register Alias Tables (RAT) for future FRAT and committed CRAT.
> On commit the commit the CRAT map is updated.
>
> You are thinking of the ROB storing future result registers
> that are later copied to the architecture registers on commit.
> The RAT then selects between ROB or architecture register sources.
>
> (A third way is using separate physical file and architecture file,
> and copying from physical to architecture on commit.)

A forth way is to use a physical register file using a CAM to map logical
registers to youngest physical register at issue and a decoder to write
into the physical registers. Each issue cycle the valid bits of the PRN
are written into a history buffer. Recovery is as simple as reading the
valid bits back out and writing them into the CAM.

infinis...@gmail.com

unread,

May 3, 2018, 5:29:18 PM5/3/18

to

I was reviewing 'standard' scoreboarding and want to make sure I understand something correctly. The instruction window(max) is equal to the number of function units, correct? As in only one instruction can be issued per function unit.

MitchAlsup

unread,

May 3, 2018, 5:52:28 PM5/3/18

to

On Thursday, May 3, 2018 at 4:29:18 PM UTC-5, infinis...@gmail.com wrote:
> I was reviewing 'standard' scoreboarding and want to make sure I understand something correctly. The instruction window(max) is equal to the number of function units, correct? As in only one instruction can be issued per function unit.

In std scoreboard, yes.

However, it is easy to add function unit pipelining to the model so the
RAW hazards (forwarding) are denoted by [FU,p#]. This is basically what the CDC 7600 did. In this case the maximum number of instructions is FU*p#, and each FU can be issued only 1 instruction per clock, but multiple FUs can be issues per clock.

Also note: one can add forwarding by having the FU ship the "go_write" signal 1 cycle earlier than CDC 6600 did.

But one MAJOR trick (not often noticed by the amateur) is that the Scoreboard only requires latches in the pipeline (rather than requiring flip-flops). This gets rid of a LOT of area and power.

infinis...@gmail.com

unread,

May 6, 2018, 4:32:42 AM5/6/18

to

In Tomasulo the instruction window is the size of the reorder buffer right?

MitchAlsup

unread,

May 6, 2018, 2:26:19 PM5/6/18

to

On Sunday, May 6, 2018 at 3:32:42 AM UTC-5, infinis...@gmail.com wrote:
> In Tomasulo the instruction window is the size of the reorder buffer right?

Which is equal to the total number of stations.

infinis...@gmail.com

unread,

May 6, 2018, 9:42:42 PM5/6/18

to

Are there any scoreboarding implementations that mimick this behavior by buffering more instructions in the read operand stage?

MitchAlsup

unread,

May 7, 2018, 11:45:39 AM5/7/18

to

K9, a cancelled AMD chip designed for 5GHz operating frequency in 2006, used
value free reservation stations to save RS area. After the instruction fired
into execution, the register file was read and forwarding performed.

thomas....@gmail.com

unread,

May 7, 2018, 8:30:52 PM5/7/18

to

Am Montag, 7. Mai 2018 17:45:39 UTC+2 schrieb MitchAlsup:

> K9, a cancelled AMD chip designed for 5GHz operating frequency in 2006,

Out of curiosity: Why was it canceled? I would assume it hit the thermal wall?

Regards,

Thomas

infinis...@gmail.com

unread,

May 8, 2018, 10:08:15 AM5/8/18

to

So only tags from the CDB are propagated to the reservation stations? Did this require more ports on the register file?
What is your opinion on this type of design?
and like thomas asks above why was it cancelled.

MitchAlsup

unread,

May 8, 2018, 1:03:01 PM5/8/18

to

Power Wall.

MitchAlsup

unread,

May 8, 2018, 1:05:05 PM5/8/18

to

On Tuesday, May 8, 2018 at 9:08:15 AM UTC-5, infinis...@gmail.com wrote:
> On Monday, May 7, 2018 at 11:45:39 AM UTC-4, MitchAlsup wrote:
> > On Sunday, May 6, 2018 at 8:42:42 PM UTC-5, infinis...@gmail.com wrote:
> > > On Sunday, May 6, 2018 at 2:26:19 PM UTC-4, MitchAlsup wrote:
> > > > On Sunday, May 6, 2018 at 3:32:42 AM UTC-5, infinis...@gmail.com wrote:
> > > > > In Tomasulo the instruction window is the size of the reorder buffer right?
> > > >
> > > > Which is equal to the total number of stations.
> > >
> > > Are there any scoreboarding implementations that mimick this behavior by buffering more instructions in the read operand stage?
> >
> > K9, a cancelled AMD chip designed for 5GHz operating frequency in 2006, used
> > value free reservation stations to save RS area. After the instruction fired
> > into execution, the register file was read and forwarding performed.
>
> So only tags from the CDB are propagated to the reservation stations?

Yes

> Did this require more ports on the register file?

No the Reservation Stations would only fire if ports were available.

> What is your opinion on this type of design?

Value-free RS:: they are OK--saves lots of area, enables larger execution windows
8-gates/clock:: way to fast

> and like thomas asks above why was it cancelled.

Power Consumption

Quadibloc

unread,

May 8, 2018, 4:06:09 PM5/8/18

to

There is a Wikipedia article on the K9, but I wouldn't be surprised if it
was talking about a different design.

Instead of describing AMD's answer to the Pentium 4, the design described
was ambitious in a completely different way: it was to issue up to
eight instructions simultaneously in the same clock cycle.

A chip that could do that wouldn't need to run at 5 GHz to be fast.

Ivan Godard

unread,

May 8, 2018, 4:19:35 PM5/8/18

to

Fancy that!

infinis...@gmail.com

unread,

May 8, 2018, 4:44:13 PM5/8/18

to

On Tuesday, May 8, 2018 at 1:05:05 PM UTC-4, MitchAlsup wrote:
> > Did this require more ports on the register file?
> No the Reservation Stations would only fire if ports were available.

What arbitrated port access? There must be some type of synchronization.

MitchAlsup

unread,

May 8, 2018, 6:19:18 PM5/8/18

to

On Tuesday, May 8, 2018 at 3:06:09 PM UTC-5, Quadibloc wrote:
> There is a Wikipedia article on the K9, but I wouldn't be surprised if it
> was talking about a different design.
>
> Instead of describing AMD's answer to the Pentium 4, the design described
> was ambitious in a completely different way: it was to issue up to
> eight instructions simultaneously in the same clock cycle.

Err, no::

K9 fetched 8 instructions every other cycle and made 2 branch predictions
associated with 3 next fetch addresses every other cycle.

K9 issued 4 instructions per cycle and took 2 cycles to issue a fetch width.

The major BW limiter was the renamer, not the register file.

> A chip that could do that wouldn't need to run at 5 GHz to be fast.

Was not my choice. Dirk said:: "Frequency is your friend" and we designed
a chip to his exacting specification. Most of the logic was running in
SPICE at 5GHz at the time of cancellation, and a majority of the layout
was done.

infinis...@gmail.com

unread,

May 8, 2018, 7:58:56 PM5/8/18

to

Also besides arbitrating port access, wouldn't this require more logic with respect to the lifetime of rename registers?

MitchAlsup

unread,

May 8, 2018, 10:03:31 PM5/8/18

to

Rename register size:: 7-bits
Register size.......:: 84-bits

So for every operand you save out of the reservation station, you get 12 new rename registers as a trade off. (This would have only been 9 new rename entries were the registers not required to support 80-bit arithmetic.)

So, given 4*16 reservation station entries each with 2-operands, not capturing the result data saves 168 (128) flip-flops at a cost of retaining the 7-bits one already needed to read the register file (and remember these are the post renamed register names), we saved 10,752 flip flops, or about 100K gates (equivalent to about 2.5--64-bit FMAC units.)

The cost was another 2 cycles in the pipeline--about 4*2*84 (672) bits plus one more set of 8 rename registers (8*7 = 56) call it 750 bits.

Now: question to those attempting to pay attention:: which is bigger 100K or 7.5K ? Second question:: by a little or by a lot?

The real question is why others did not discard the operand capture part of their reservation stations.

Ivan Godard

unread,

May 9, 2018, 12:12:02 AM5/9/18

to

Perhaps, in those days before branch predictors got better, the added
mispredict penalty was considered more important and the area/power?

infinis...@gmail.com

unread,

May 9, 2018, 4:01:53 PM5/9/18

to

On Sunday, May 6, 2018 at 2:26:19 PM UTC-4, MitchAlsup wrote:

I've been reviewing Tomasulo and just want to make sure I'm understanding this correctly. The original Tomasulo didn't have a reorder buffer or a history buffer or checkpoint repair so the register file had to have the same number of registers as there are reservation stations? This seems obvious but I just want to make sure I'm thinking about this correctly.

infinis...@gmail.com

unread,

May 9, 2018, 4:15:49 PM5/9/18

to

I understand the savings implied by getting rid of values in the reservation station. But I was talking about management/control. With reservations stations with both tags and values once a value is on the CDB the rename register becomes free again, since those values are buffered in the reservation stations. If the reservation station don't buffer values then you would need to manage when all instructions dependent on an output read their operands before you know a rename register becomes free again... right?

MitchAlsup

unread,

May 9, 2018, 4:57:21 PM5/9/18

to

No, the register file still had 4 registers, but there were 2+3+3+2 stations
{ADD, MUL, LD, ST}. Here registers were renamed into station numbers.

Once you rename into a reorder buffer (so you can recover) the number of ROB
entries has to be at least as large as the number of RS entries.

MitchAlsup

unread,

May 9, 2018, 4:59:21 PM5/9/18

to

With value capture stations, you read the register file at issue time.
With value free stations, you read the file at execute time.

The number of ports needed is larger at issue time than at execute time
because you are almost never executing as fast as you can issue (Mill
excluded.)

infinis...@gmail.com

unread,

May 9, 2018, 5:20:36 PM5/9/18

to

Yeah I realized this upon reading more about it in a book I own. But I don't understand how the original implementation dealt with exemptions and interrupts. I suppose with an external interrupt you just stop issuing instructions and wait till all pending instructions complete so you have consistency. But lets say you get a divide by zero in the FPU then what?

infinis...@gmail.com

unread,

May 9, 2018, 5:41:23 PM5/9/18

to

On Wednesday, May 9, 2018 at 4:59:21 PM UTC-4, MitchAlsup wrote:

> With value capture stations, you read the register file at issue time.
> With value free stations, you read the file at execute time.
>
> The number of ports needed is larger at issue time than at execute time
> because you are almost never executing as fast as you can issue (Mill
> excluded.)

I understand and agree with the first part of what you said, and I can see what you're saying in the second part. But this really doesn't address my concern about knowing when a rename register becomes free to use again. If you have to read the register file at execute time the lifetime of an rename register is no longer ends implicitly when it is written to. So how is this handled?

infinis...@gmail.com

unread,

May 9, 2018, 5:55:14 PM5/9/18

to

On Wednesday, May 9, 2018 at 5:41:23 PM UTC-4, infinis...@gmail.com wrote:

> I understand and agree with the first part of what you said, and I can see what you're saying in the second part. But this really doesn't address my concern about knowing when a rename register becomes free to use again. If you have to read the register file at execute time the lifetime of an rename register is no longer ends implicitly when it is written to. So how is this handled?

Oh wait I realize what I was misunderstanding. The rename register is still alive until it gets 'retired'. So basically the ROB is involved with its lifetime.

MitchAlsup

unread,

May 9, 2018, 6:27:58 PM5/9/18

to

If you rename into a physical register file, you don't have this problem.
A physical register file is like a ROB except that it can cover both OoO
state and architectural state at the same time. It also avoids the movement
at retirement time.

Anton Ertl

unread,

May 10, 2018, 2:43:42 AM5/10/18

to

MitchAlsup <Mitch...@aol.com> writes:
>On Wednesday, May 9, 2018 at 3:15:49 PM UTC-5, infinis...@gmail.com wrote:
>> On Tuesday, May 8, 2018 at 10:03:31 PM UTC-4, MitchAlsup wrote:

>> > The real question is why others did not discard the operand capture par=
>t of their reservation stations.

A while ago, I read somewhere that Intel switched to value-free when
they introduced AVX on the Sandy Bridge, because value capture is too
expensive with 256-bit values.

>With value capture stations, you read the register file at issue time.
>With value free stations, you read the file at execute time.
>

>The number of ports needed is larger at issue time than at execute time=20

>because you are almost never executing as fast as you can issue (Mill
>excluded.)

What about forwarding? Value capture stations can grab the results
from the bus, saving a register read port. Value-free stations
sometimes (how often?) miss that opportunity. Whether the overall
read port requirements are higher or lower is not clear. And of
course, if you want to make use of that to reduce overall read port
requirements, you complicate the logic because you need to arbitrate
the remaining read ports.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

May 10, 2018, 2:56:54 AM5/10/18

to

infinis...@gmail.com writes:
[Tomasulo]
>But I don=
>'t understand how the original implementation dealt with exemptions and int=
>errupts. I suppose with an external interrupt you just stop issuing instru=
>ctions and wait till all pending instructions complete so you have consiste=

>ncy. But lets say you get a divide by zero in the FPU then what?

That's why you often read about "imprecise interrupts" in connection
with Tomasulo. In particular,
<https://en.wikipedia.org/wiki/Tomasulo_algorithm#Exceptions> says:

|[Programs] that experience "imprecise" exceptions generally cannot
|restart or re-execute, as the system cannot determine the specific
|instruction that took the exception.

infinis...@gmail.com

unread,

May 10, 2018, 11:26:49 AM5/10/18

to

I understand that but what I'm saying is thats the solution halt the program? Basically what gets me is that the purpose of Tomasulo was supposedly to provide better performance while remaining compatible with the rest of the 360 series. So are imprecise interrupts/exceptions common to the whole series?

infinis...@gmail.com

unread,

May 10, 2018, 11:36:03 AM5/10/18

to

When you say physical register file you're referring to a 'merged' register file where both rename and architectural registers live? Wouldn't you still need to keep track of when an instruction needs to retire if you want in order retire i.e. want precise exceptions? If you're saying no then why not?

Nick Maclaren

unread,

May 10, 2018, 11:47:19 AM5/10/18

to

In article <041d2250-3ba7-41f6...@googlegroups.com>,

<infinis...@gmail.com> wrote:
>
>I understand that but what I'm saying is thats the solution halt the
>program? Basically what gets me is that the purpose of Tomasulo was
>supposedly to provide better performance while remaining compatible with
>the rest of the 360 series. So are imprecise interrupts/exceptions
>common to the whole series?

Yes. Models vary as to which exceptions are imprecise, but their
existence is common to the whole series. Even the System/390, though
I vaguely recall that few models of that range actually had any
imprecise exceptions. But it's a heck of a long time ago now!

Actually, you can very often restart and even sometimes reexecute,
but that needs the compiler and run-time system writer to cooperate
to enable it. Attempting it without that CAN be done, but you don't
want to go there, I can assure you!

Regards,
Nick Maclaren.

Anton Ertl

unread,

May 10, 2018, 12:15:46 PM5/10/18

to

infinis...@gmail.com writes:
>I understand that but what I'm saying is thats the solution halt the progra=
>m? Basically what gets me is that the purpose of Tomasulo was supposedly t=
>o provide better performance while remaining compatible with the rest of th=
>e 360 series. So are imprecise interrupts/exceptions common to the whole s=
>eries?

I don't think so.

My take on this is: The 360/91 was built and sold as a supercomputer,
and behaves like one, not like a true S/360 family member. I.e.,
correctness is sacrificed for speed. And I think that's exactly what
the customers wanted (well, except the two customers who used them to
run Cobol programs; they did not get speed, but at least they also did
not get imprecise interrupts, because they did not use the FPU).

Why did IBM make it run the S/360 instructions, then? My guess is
that the main reason is that this was part of the marketing of the
S/360 line, i.e., it sends the message "We cover all bases", from the
cheap Model 20 (which aparently was even less of a proper member than
the 91) to the supercomputer Models 91 and 195. Wikipedia reports
$113k for the Model 30 (and the 20 was certainly cheaper) and
$7M-$12.5M for the 195.

A benefit (but probably not decisive) is that they could use S/360
software on the Model 91, in particular, they did not needed a
separate Fortran compiler.

MitchAlsup

unread,

May 10, 2018, 12:33:01 PM5/10/18

to

On Thursday, May 10, 2018 at 1:43:42 AM UTC-5, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >On Wednesday, May 9, 2018 at 3:15:49 PM UTC-5, infinis...@gmail.com wrote:
> >> On Tuesday, May 8, 2018 at 10:03:31 PM UTC-4, MitchAlsup wrote:
> >> > The real question is why others did not discard the operand capture par=
> >t of their reservation stations.
>
> A while ago, I read somewhere that Intel switched to value-free when
> they introduced AVX on the Sandy Bridge, because value capture is too
> expensive with 256-bit values.
>

Good to know.

> >With value capture stations, you read the register file at issue time.
> >With value free stations, you read the file at execute time.
> >
> >The number of ports needed is larger at issue time than at execute time=20
> >because you are almost never executing as fast as you can issue (Mill
> >excluded.)
>
> What about forwarding? Value capture stations can grab the results
> from the bus, saving a register read port.

Err, no.
The issue logic is setup to read the register file in case it is not found
by forwarding. So, they read the RF anyway.
In addition, this is also the cycle where the issue logic is determining the
name by which that dynamic value is found (its tag), so one either has to
read the RF early using architectural register number or read the RF one
cycle later using the tag--while at the same time looking to see if the tag
is going to be on the forwarding path.

> Value-free stations
> sometimes (how often?) miss that opportunity.

If the RS entry fires and is scheduled immediately, then it is known
up front that at least one operand is going to arrive by forwarding--
in addition the RS entry knows which one, and can filter only needed
reads to the RF.

MitchAlsup

unread,

May 10, 2018, 12:38:12 PM5/10/18

to

Yes, but you don't need to move the data, you can do it all with valid bits for the CAM*--which by the way is exactly how one recovers from a branch--
just write in new valid bits to the CAM and the PRF is in the state it was
when the mass of instructions that took the mispredict was issued.

For the "it retires OK" case all one needs is to update the CAM history
retire pointer so that those CAM valid bits cannot be used to recover to
the retired instructions. If one tries to recover to a pointer that is not
in the current execution window--one takes a machine check.

MitchAlsup

unread,

May 10, 2018, 12:40:51 PM5/10/18

to

On Thursday, May 10, 2018 at 11:15:46 AM UTC-5, Anton Ertl wrote:
> infinis...@gmail.com writes:
> >I understand that but what I'm saying is thats the solution halt the progra=
> >m? Basically what gets me is that the purpose of Tomasulo was supposedly t=
> >o provide better performance while remaining compatible with the rest of th=
> >e 360 series. So are imprecise interrupts/exceptions common to the whole s=
> >eries?
>
> I don't think so.
>
> My take on this is: The 360/91 was built and sold as a supercomputer,
> and behaves like one, not like a true S/360 family member. I.e.,
> correctness is sacrificed for speed. And I think that's exactly what
> the customers wanted (well, except the two customers who used them to
> run Cobol programs; they did not get speed, but at least they also did
> not get imprecise interrupts, because they did not use the FPU).

Correct, Read Anderson's paper on the partially ordered IU attached to
the Tomasolu FPU.

Nick Maclaren

unread,

May 10, 2018, 3:19:33 PM5/10/18

to

In article <2018May1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>
>My take on this is: The 360/91 was built and sold as a supercomputer,
>and behaves like one, not like a true S/360 family member. I.e.,
>correctness is sacrificed for speed. And I think that's exactly what
>the customers wanted (well, except the two customers who used them to
>run Cobol programs; they did not get speed, but at least they also did
>not get imprecise interrupts, because they did not use the FPU).

Not quite. It was built for specialist number-crunching, but was NOT
a supercomputer or really sold as one. Also, there were quite a few
imprecise interrupts that were not generated by the FPU - 0C5, for one,
and that was pretty basic (addressing exception). Also, imprecise
exceptions did NOT sacrifice any correctness, though I have used
other machines where they did :-(

>Why did IBM make it run the S/360 instructions, then? My guess is
>that the main reason is that this was part of the marketing of the
>S/360 line, i.e., it sends the message "We cover all bases", from the
>cheap Model 20 (which aparently was even less of a proper member than
>the 91) to the supercomputer Models 91 and 195. Wikipedia reports
>$113k for the Model 30 (and the 20 was certainly cheaper) and
>$7M-$12.5M for the 195.

I was told that there was a specific customer that commissioned it
but, by its own rules, could buy only openly marketed products. I
can't imagine who that could have been :-)

>A benefit (but probably not decisive) is that they could use S/360
>software on the Model 91, in particular, they did not needed a
>separate Fortran compiler.

It and the run-time system were heavily hacked for the Model 91. When
Santa Teresa rewrote the latter for VS Fortran, they asked me what all
the 'BCR 1,0' instructions were doing and were they still needed? They
were the Model 91 pipeline drain ones, so no :-)

Regards,
Nick Maclaren.

Quadibloc

unread,

May 10, 2018, 5:49:15 PM5/10/18

to

An issue of Datamation had articles on the 6600 and the 195 and it said "supercomputers" on the cover.

Remember, this way before the STAR-100 and the Cray-I, so that was super by the standards then.

MitchAlsup

unread,

May 10, 2018, 8:53:45 PM5/10/18

to

On Thursday, May 10, 2018 at 4:49:15 PM UTC-5, Quadibloc wrote:
> An issue of Datamation had articles on the 6600 and the 195 and it said "supercomputers" on the cover.
>
> Remember, this way before the STAR-100 and the Cray-I, so that was super by the standards then.

The colloquial definition of supercomputer is the fastest machine of its day/era.

The CDC 6600 certainly qualifies.
The 360/91 does not qualify (if only barely)
The CDC 7600 certainly qualifies.

Anton Ertl

unread,

May 11, 2018, 1:44:42 AM5/11/18

to

MitchAlsup <Mitch...@aol.com> writes:
>The colloquial definition of supercomputer is the fastest machine of its day/era.

Not in my understanding. No, a computer with a 3.4GHz Pentium 4
Northwood is not a supercomputer, although it is the fastest computer
of our time (if we measure speed in dependent integer additions).

My understanding agrees with the definition of Wikipedia:

|A supercomputer is a computer with a high level of performance
|compared to a general-purpose computer. Performance of a supercomputer
|is measured in floating-point operations per second (FLOPS) instead of
|million instructions per second (MIPS).

And by that definition, the IBM 360/91 does qualify. Its
supercomputer (rather than mainframe) orientation is also demonstrated
by them implementing the decimal instructions only through
interpretation.

Also, by your definition, there cannot be a TOP 500 list of
supercomputers, only a TOP 1. And most of the 360/91s would have been
on the TOP500 if it existed at the time.

Nick Maclaren

unread,

May 11, 2018, 3:50:58 AM5/11/18

to

In article <8fe2f087-1b5b-41f8...@googlegroups.com>,

Yes, the 195 was. I was talking about the 91, which was a very odd
beast.

Regards,
Nick Maclaren.

Quadibloc

unread,

May 11, 2018, 8:52:23 AM5/11/18

to

The 360/91 was intended to be a supercomputer, to succeed where STRETCH failed.

Yes, it didn't quite make it; its pipeline didn't perform as well as expected.

And the 360/85 with cache, intended to be a large-scale commercial
general-purpose machine, performed better than expected.

So I can understand that a machine that doesn't outperform a
contemporary large-scale mainframe is hard to call a supercomputer.

Nick Maclaren

unread,

May 11, 2018, 9:50:27 AM5/11/18

to

In article <2018May1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>MitchAlsup <Mitch...@aol.com> writes:
>>The colloquial definition of supercomputer is the fastest machine of
>its day/era.
>
>Not in my understanding. No, a computer with a 3.4GHz Pentium 4
>Northwood is not a supercomputer, although it is the fastest computer
>

>My understanding agrees with the definition of Wikipedia:
>
>|A supercomputer is a computer with a high level of performance
>|compared to a general-purpose computer. Performance of a supercomputer
>|is measured in floating-point operations per second (FLOPS) instead of
>|million instructions per second (MIPS).
>
>And by that definition, the IBM 360/91 does qualify. Its
>supercomputer (rather than mainframe) orientation is also demonstrated
>by them implementing the decimal instructions only through
>interpretation.

Er, plenty of mainframes emulated decimal operations, because that
could be done efficiently. All that proves is that it was not
designed for running financial applications. As I said, IBMers told
me that it was a specialist floating-point machine, with some very
odd characteristics added on special request. That doesn't make it
a supercomputer, though it assuredly was NOT a general-purpose machine.

I don't agree with Mitch's definition, but Wikipedia's isn't right
according to the usage of the time. Whether it is the way the term is
used now, I can't say.

Regards,
Nick Maclaren.

infinis...@gmail.com

unread,

May 11, 2018, 12:21:01 PM5/11/18

to

On Thursday, May 10, 2018 at 12:38:12 PM UTC-4, MitchAlsup wrote:

> > When you say physical register file you're referring to a 'merged' register file where both rename and architectural registers live? Wouldn't you still need to keep track of when an instruction needs to retire if you want in order retire i.e. want precise exceptions? If you're saying no then why not?
>
> Yes, but you don't need to move the data, you can do it all with valid bits for the CAM*--which by the way is exactly how one recovers from a branch--
> just write in new valid bits to the CAM and the PRF is in the state it was
> when the mass of instructions that took the mispredict was issued.
>
> For the "it retires OK" case all one needs is to update the CAM history
> retire pointer so that those CAM valid bits cannot be used to recover to
> the retired instructions. If one tries to recover to a pointer that is not
> in the current execution window--one takes a machine check.

When you say yes are you saying yes to the merged register file or needing to keep track on when an instruction retires? I understand that in a merged register file you don't need to move any data around... you just need to change the mappings for the 'arch rat'. But precise exemptions for an ooo architecture requires something to keep track of sequence. History buffers seem to be for in-order architectures only. So all that's left is a ROB or checkpoint repair... your suggestion doesn't seem to be either... so what exactly are you suggesting? A CAM would either get you all versions of an architectural register or a particular version of a arch register. There is no concept of sequence, so how would you know which version of the registers to pick?

MitchAlsup

unread,

May 11, 2018, 12:44:42 PM5/11/18

to

On Friday, May 11, 2018 at 12:44:42 AM UTC-5, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >The colloquial definition of supercomputer is the fastest machine of its day/era.
>
> Not in my understanding. No, a computer with a 3.4GHz Pentium 4
> Northwood is not a supercomputer, although it is the fastest computer
> of our time (if we measure speed in dependent integer additions).

I disagree. A 3.4 GHz Pentium is usefully fast, but 65536 of them are way
faster. It is the system that is the supercomputer (or not) not a CPU.

>
> My understanding agrees with the definition of Wikipedia:
>
> |A supercomputer is a computer with a high level of performance
> |compared to a general-purpose computer. Performance of a supercomputer
> |is measured in floating-point operations per second (FLOPS) instead of
> |million instructions per second (MIPS).
>
> And by that definition, the IBM 360/91 does qualify. Its
> supercomputer (rather than mainframe) orientation is also demonstrated
> by them implementing the decimal instructions only through
> interpretation.
>
> Also, by your definition, there cannot be a TOP 500 list of
> supercomputers, only a TOP 1. And most of the 360/91s would have been
> on the TOP500 if it existed at the time.

System not CPU--dolt.

MitchAlsup

unread,

May 11, 2018, 12:58:46 PM5/11/18

to

On Friday, May 11, 2018 at 11:21:01 AM UTC-5, infinis...@gmail.com wrote:
> On Thursday, May 10, 2018 at 12:38:12 PM UTC-4, MitchAlsup wrote:
>
> > > When you say physical register file you're referring to a 'merged' register file where both rename and architectural registers live? Wouldn't you still need to keep track of when an instruction needs to retire if you want in order retire i.e. want precise exceptions? If you're saying no then why not?
> >
> > Yes, but you don't need to move the data, you can do it all with valid bits for the CAM*--which by the way is exactly how one recovers from a branch--
> > just write in new valid bits to the CAM and the PRF is in the state it was
> > when the mass of instructions that took the mispredict was issued.
> >
> > For the "it retires OK" case all one needs is to update the CAM history
> > retire pointer so that those CAM valid bits cannot be used to recover to
> > the retired instructions. If one tries to recover to a pointer that is not
> > in the current execution window--one takes a machine check.
>
> When you say yes are you saying yes to the merged register file or needing to keep track on when an instruction retires? I understand that in a merged register file you don't need to move any data around... you just need to change the mappings for the 'arch rat'. But precise exemptions for an ooo architecture requires something to keep track of sequence. History buffers seem to be for in-order architectures only.

I am NOT talking about a History FILE, I am talking about how one recovers
a physical register file. One does this by reading out the valid bits from
the History of CAM valid bits per issue cycle, and writing those back into
the valid bits of the CAM.

> So all that's left is a ROB or checkpoint repair... your suggestion doesn't seem to be either... so what exactly are you suggesting? A CAM would either get you all versions of an architectural register or a particular version of a arch register. There is no concept of sequence, so how would you know which version of the registers to pick?

A physical register file is indexed by a CAM at issue and indexed by a
decoder at result delivery.

Every issue cycle the valid bits of the CAM of the PRF are loaded into
a new slot in the History Buffer; and the CAM is updated with new mappings
and new valid bits. A physical register is always in one of 3 states::
Architectural, waiting for a result, waiting for retire. The PRN CAM
monitors the result delivery and asseerts bits into the History buffer.
The History buffer accumulates whether all issued instructions have
been performed, and this is used to move the retirement pointer.

Each issue cycle, the History buffer supplies PRNs to the available
register list, and each issue cycle, registers are allocated from the
available register list. On a recovery, the History buffer erases
the pending result deliveries, and these PRNs become available for
new instruction issue.

So the history buffer is a PRN-width by ExecutionWindow-Deep of cells.
Each cell contains 2 bits of information and 3 or 4 gates to watch
the various select lines; and a accumulating readout port which provides
the available register list.

I have schematics if you want.

Anton Ertl

unread,

May 11, 2018, 1:46:03 PM5/11/18

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, May 11, 2018 at 12:44:42 AM UTC-5, Anton Ertl wrote:
>> MitchAlsup <?????@aol.com> writes:
>> >The colloquial definition of supercomputer is the fastest machine of its day/era.
>>
>> Not in my understanding. No, a computer with a 3.4GHz Pentium 4
>> Northwood is not a supercomputer, although it is the fastest computer
>> of our time (if we measure speed in dependent integer additions).
>
>I disagree. A 3.4 GHz Pentium is usefully fast, but 65536 of them are way
>faster.

Not if we measure speed in dependent integer additions.

>It is the system that is the supercomputer (or not) not a CPU.

It's the measurement that defines what "fastest" is. And the TOP500,
which intends to rank supercomputers, defines it through the LINPACK
benchmark.

>> Also, by your definition, there cannot be a TOP 500 list of
>> supercomputers, only a TOP 1. And most of the 360/91s would have been
>> on the TOP500 if it existed at the time.
>
>System not CPU--dolt.

The IBM System 360 Model 91 is a system (it already says so in its
name).

As for the Pentium 4, that's why I specified "a computer with a 3.4GHz
Pentium 4 Northwood".

But what does this have to do with the question of whether there can
be only one supercomputer at a time, or 500? Or more?

One other thing that influences the characteristics of supercomputers
is that LINPACK performance is everything. Other criteria are, at
best, secondary.

Cost? Yes, sure, we want the fastest supercomputer we can afford,
but, like for the tallest building, we are prepared to spend funny
amounts of money for the honor to be on top.

Compatibility? Unimportant.

Ease of programming? As long as we can make LINPACK go fast in time
for the next TOP 500, unimportant.

Other applications? Well, let's hope that LINPACK is representative
of these other applications. If that is the case, these are
supercomputer applications, otherwise not.

etc.

Not that I find this good, but the way you evaluate things forms the
territory after a while.

MitchAlsup

unread,

May 11, 2018, 7:52:39 PM5/11/18

to

On Friday, May 11, 2018 at 12:46:03 PM UTC-5, Anton Ertl wrote:

> MitchAlsup <????@aol.com> writes:
> >On Friday, May 11, 2018 at 12:44:42 AM UTC-5, Anton Ertl wrote:
> >> MitchAlsup <?????@aol.com> writes:
> >> >The colloquial definition of supercomputer is the fastest machine of its day/era.
> >>
> >> Not in my understanding. No, a computer with a 3.4GHz Pentium 4
> >> Northwood is not a supercomputer, although it is the fastest computer
> >> of our time (if we measure speed in dependent integer additions).
> >
> >I disagree. A 3.4 GHz Pentium is usefully fast, but 65536 of them are way
> >faster.
>
> Not if we measure speed in dependent integer additions.
>
> >It is the system that is the supercomputer (or not) not a CPU.
>
> It's the measurement that defines what "fastest" is. And the TOP500,
> which intends to rank supercomputers, defines it through the LINPACK
> benchmark.
>
> >> Also, by your definition, there cannot be a TOP 500 list of
> >> supercomputers, only a TOP 1. And most of the 360/91s would have been
> >> on the TOP500 if it existed at the time.
> >
> >System not CPU--dolt.
>
> The IBM System 360 Model 91 is a system (it already says so in its
> name).
>
> As for the Pentium 4, that's why I specified "a computer with a 3.4GHz
> Pentium 4 Northwood".
>
> But what does this have to do with the question of whether there can
> be only one supercomputer at a time, or 500? Or more?

I am not saying there can only be one supercomputer at a time.
I am saying that a system of 65K fast processors is faster at doing things people are willing to pay $$$ for supercomputers than a system with 1
processor.

>
> One other thing that influences the characteristics of supercomputers
> is that LINPACK performance is everything. Other criteria are, at
> best, secondary.

LINPACK is a great big parallelizable floating point benchmark. Nearly
anything that can be described by physics equations can be made to
look and smell remarkably like LINPACK.

>
> Cost? Yes, sure, we want the fastest supercomputer we can afford,
> but, like for the tallest building, we are prepared to spend funny
> amounts of money for the honor to be on top.
>
> Compatibility? Unimportant.
>
> Ease of programming? As long as we can make LINPACK go fast in time
> for the next TOP 500, unimportant.

If you are paying $100M for your supercomputer, you can afford some
numericalists to perform your programming over the course of a few years.

MitchAlsup

unread,

May 11, 2018, 7:59:38 PM5/11/18

to

On Friday, May 11, 2018 at 12:46:03 PM UTC-5, Anton Ertl wrote:

> MitchAlsup <???@aol.com> writes:
> >On Friday, May 11, 2018 at 12:44:42 AM UTC-5, Anton Ertl wrote:
> >> MitchAlsup <?????@aol.com> writes:
> >> >The colloquial definition of supercomputer is the fastest machine of its day/era.
> >>
> >> Not in my understanding. No, a computer with a 3.4GHz Pentium 4
> >> Northwood is not a supercomputer, although it is the fastest computer
> >> of our time (if we measure speed in dependent integer additions).
> >
> >I disagree. A 3.4 GHz Pentium is usefully fast, but 65536 of them are way
> >faster.
>
> Not if we measure speed in dependent integer additions.

A) show me a supercomputer application that is measured in any way by serially dependent additions ?

B) A parallel prefix adder can perform these serially dependent additions in parallel--while giving you access to all the intermediate sums, also in parallel.

So clearly you are simply obfuscating this thread.

Nick Maclaren

unread,

May 12, 2018, 3:41:39 AM5/12/18

to

In article <33d98e8e-9d59-49cc...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>On Friday, May 11, 2018 at 12:46:03 PM UTC-5, Anton Ertl wrote:
>>
>> One other thing that influences the characteristics of supercomputers
>> is that LINPACK performance is everything. Other criteria are, at
>> best, secondary.
>
>LINPACK is a great big parallelizable floating point benchmark. Nearly
>anything that can be described by physics equations can be made to
>look and smell remarkably like LINPACK.

Er, no. LINPACK is not representative of most applications, because
it has an unusually high operation/access ratio and very good locality.
McCalpin wrote STREAMS for a reason, and I favoured latency testing,
because that was more indicative.

>> Cost? Yes, sure, we want the fastest supercomputer we can afford,
>> but, like for the tallest building, we are prepared to spend funny
>> amounts of money for the honor to be on top.
>>
>> Compatibility? Unimportant.
>>
>> Ease of programming? As long as we can make LINPACK go fast in time
>> for the next TOP 500, unimportant.

All of those have some truth in them, but even more falsehood. No,
supercomputing is NOT just what you read in the less clueful media.

>If you are paying $100M for your supercomputer, you can afford some
>numericalists to perform your programming over the course of a few years.

You can't make breakthroughs to order, and a lot of problems have no
known way of solving them highly efficiently. A few are known not to
have such a solution.

>> Other applications? Well, let's hope that LINPACK is representative
>> of these other applications. If that is the case, these are
>> supercomputer applications, otherwise not.

That is twaddle, pure and simple.

>> Not that I find this good, but the way you evaluate things forms the
>> territory after a while.

Now, THAT is true.

Regards,
Nick Maclaren.

Bruce Hoult

unread,

May 12, 2018, 7:17:03 AM5/12/18

to

A supercomputer is a device for turning a CPU-bound problem into a memory-bound problem.

[the original said "I/O bound" but cache is the new core and RAM is the new disk]

Nick Maclaren

unread,

May 12, 2018, 7:37:57 AM5/12/18

to

In article <582bfb24-52fa-4d89...@googlegroups.com>,

Bruce Hoult <bruce...@gmail.com> wrote:
>
>A supercomputer is a device for turning a CPU-bound problem into a
>memory-bound problem.
>
>[the original said "I/O bound" but cache is the new core and RAM is the
>new disk]

Yes, indeed. But it's getting increasingly true that "memory-bound"
should be replaced by "communication-bound" :-)

Regards,
Nick Maclaren.

infinis...@gmail.com

unread,

May 13, 2018, 8:14:29 PM5/13/18

to

On Friday, May 11, 2018 at 12:58:46 PM UTC-4, MitchAlsup wrote:

> > When you say yes are you saying yes to the merged register file or needing to keep track on when an instruction retires? I understand that in a merged register file you don't need to move any data around... you just need to change the mappings for the 'arch rat'. But precise exemptions for an ooo architecture requires something to keep track of sequence. History buffers seem to be for in-order architectures only.
>
> I am NOT talking about a History FILE, I am talking about how one recovers
> a physical register file. One does this by reading out the valid bits from
> the History of CAM valid bits per issue cycle, and writing those back into
> the valid bits of the CAM.
>

Is there a difference between a 'History File' and a history buffer?
Ok I sort of get what you're saying but I have to ask some questions to clear things up. Before I ask those questions though I'll also have to ask does this work with multi-instruction per cycle issue and multi-instruction per cycle retire?

> > So all that's left is a ROB or checkpoint repair... your suggestion doesn't seem to be either... so what exactly are you suggesting? A CAM would either get you all versions of an architectural register or a particular version of a arch register. There is no concept of sequence, so how would you know which version of the registers to pick?
>
> A physical register file is indexed by a CAM at issue and indexed by a
> decoder at result delivery.
>

I thought a physical register file is indexed by a LUT at issue? A simple rename table that keeps track of the youngest version of each architectural register.(indirection) If its done by a CAM what exactly is the contents its being addressed by? Architectural register and sequence number?

> Every issue cycle the valid bits of the CAM of the PRF are loaded into
> a new slot in the History Buffer; and the CAM is updated with new mappings
> and new valid bits. A physical register is always in one of 3 states::
> Architectural, waiting for a result, waiting for retire.

So each entry in the History buffer is numvalidbits*numberofregisters wide? Also doesn't this pose a routing nightmare? If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?

> The PRN CAM
> monitors the result delivery and asseerts bits into the History buffer.
> The History buffer accumulates whether all issued instructions have
> been performed, and this is used to move the retirement pointer.

Ok this is where I'm confused (i'm trying to figure this out as I type this hoping it becomes clearer than when I thought about it), the 'retirement pointer' indexes into what? The history buffer you mention or something else? Does the history buffer include any other data besides the valid bits at time of instruction issue? Because I'm not seeing how one gets sequence of registers to retire from this.

>
> Each issue cycle, the History buffer supplies PRNs to the available
> register list, and each issue cycle, registers are allocated from the
> available register list. On a recovery, the History buffer erases
> the pending result deliveries, and these PRNs become available for
> new instruction issue.

So the history buffer is somehow responsible for figuring out when a currently arch register becomes superseded by a retiring instruction? Also what happened to the CAM in all this?

>
> So the history buffer is a PRN-width by ExecutionWindow-Deep of cells.
> Each cell contains 2 bits of information and 3 or 4 gates to watch
> the various select lines; and a accumulating readout port which provides
> the available register list.
>
> I have schematics if you want.

You know what before I ask more questions can you send me the schematics? My email is in my posts I think. Also if you have a complete write up (not necessarily by you) that you know of on the web it would be greatly appreciated. I tried but I'm still confused since what you describe seems to be able to undo some but not all changes to register state, and I'm not seeing how you decide to go from the waiting to retire state to the retire state.

MitchAlsup

unread,

May 13, 2018, 9:10:14 PM5/13/18

to

On Sunday, May 13, 2018 at 7:14:29 PM UTC-5, infinis...@gmail.com wrote:
> On Friday, May 11, 2018 at 12:58:46 PM UTC-4, MitchAlsup wrote:
>
> > > When you say yes are you saying yes to the merged register file or needing to keep track on when an instruction retires? I understand that in a merged register file you don't need to move any data around... you just need to change the mappings for the 'arch rat'. But precise exemptions for an ooo architecture requires something to keep track of sequence. History buffers seem to be for in-order architectures only.
> >
> > I am NOT talking about a History FILE, I am talking about how one recovers
> > a physical register file. One does this by reading out the valid bits from
> > the History of CAM valid bits per issue cycle, and writing those back into
> > the valid bits of the CAM.
> >
>
> Is there a difference between a 'History File' and a history buffer?

Yes, one is a storage means to recover a logical register file,
The other is a storage means to recover the CAMs of a physical register file.

> Ok I sort of get what you're saying but I have to ask some questions to clear things up. Before I ask those questions though I'll also have to ask does this work with multi-instruction per cycle issue and multi-instruction per cycle retire?

I invented this for a 6-issue, 6*k retire machine. We could actually retire
the whole execution window 6*16 in a single step--if the oldest instruction
was the one preventing new instruction issue. This proved to be overkill,
but it fell out for free the way we implemented the PRF.

>
> > > So all that's left is a ROB or checkpoint repair... your suggestion doesn't seem to be either... so what exactly are you suggesting? A CAM would either get you all versions of an architectural register or a particular version of a arch register. There is no concept of sequence, so how would you know which version of the registers to pick?
> >
> > A physical register file is indexed by a CAM at issue and indexed by a
> > decoder at result delivery.
> >
>
> I thought a physical register file is indexed by a LUT at issue?

MY Physical register file is a completely integrated "do the right thing"
Register file, reorder buffer, RAT, and whatever else was needed to do
wide multi issue, wide recovery, instruction sequencing. It is not a
bunch of components carefully sequenced, it is an integrated package
where all manipulations transpire from the same input wires. For example
the decoder that writes the result controls both the write of the
register, but the same select line controls the history buffer logic
so the HB understands that that result was exception free and can be
retired. When all instructions in an issue are "performed" the whole
issue can be retired.

You have to see it from the hardware gate/transistor/wire point of view
rather than how they teach computer architecture in universities these
days.

> A simple rename table that keeps track of the youngest version of each architectural register.(indirection) If its done by a CAM what exactly is the contents its being addressed by? Architectural register and sequence number?

You could index a PRF with a RAT, but instead of a RAT we used a CAM.

Each PRF register has a current architectural register name. Only one
name could be valid at any point in time, but there was always one name
that was valid.

We kept track of this with a valid bit per CAM or per register if you
prefer. The History Buffer was a store of the valid bits in the cam,
but to simplify other things (such as branch recovery) we kept two bits
ber register, one held the valid bits if this instruction issue gets
retired, the other one holds the state if the issue gets recovered.

So there were 16*128*2 bits in the history buffer

>
> > Every issue cycle the valid bits of the CAM of the PRF are loaded into
> > a new slot in the History Buffer; and the CAM is updated with new mappings
> > and new valid bits. A physical register is always in one of 3 states::
> > Architectural, waiting for a result, waiting for retire.
>
> So each entry in the History buffer is numvalidbits*numberofregisters wide? Also doesn't this pose a routing nightmare?

The history buffer was topologically right on top of the CAM bits in the
PRF, so the routing was "brain dead easy"--also note the width of the
history buffer was exactly the width of the PRF--one read per cycle,
one write per cycle (of 2 bits).

> If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?

But your problem disappears if they ARE parallel to each other and
topologically adjacent.

> > The PRN CAM
> > monitors the result delivery and asseerts bits into the History buffer.
> > The History buffer accumulates whether all issued instructions have
> > been performed, and this is used to move the retirement pointer.
>
> Ok this is where I'm confused (i'm trying to figure this out as I type this hoping it becomes clearer than when I thought about it), the 'retirement pointer' indexes into what? The history buffer you mention or something else? Does the history buffer include any other data besides the valid bits at time of instruction issue? Because I'm not seeing how one gets sequence of registers to retire from this.
>
> >
> > Each issue cycle, the History buffer supplies PRNs to the available
> > register list, and each issue cycle, registers are allocated from the
> > available register list. On a recovery, the History buffer erases
> > the pending result deliveries, and these PRNs become available for
> > new instruction issue.
>
> So the history buffer is somehow responsible for figuring out when a currently arch register becomes superseded by a retiring instruction?

The valid bits of the CAM are written each cycle into the history buffer.
CAM entries that are not waiting for their result and remaining arch-
tecturally visible are suitable for having their valid bits flip from
present to waiting. CAM entries which cannot be seen by issuing instructions
and are not waiting for their result are available for allocation. These
entries are allocated by flipping their valid bit from invisible to waiting.

Prior to an issue, the current valid bits in the CAM and the previous issue
allocation are written into the history buffer for recovery.

As issue are retired, the history buffer entry corresponding to the prior state of the CAM are ORed into the currently available register list.
As an issue is recovered, the other history data is ORed into the available list.
At the same time the entries being removed from the currently available list
are NANDed off the same list.

> Also what happened to the CAM in all this?
>
> >
> > So the history buffer is a PRN-width by ExecutionWindow-Deep of cells.
> > Each cell contains 2 bits of information and 3 or 4 gates to watch
> > the various select lines; and a accumulating readout port which provides
> > the available register list.
> >
> > I have schematics if you want.
>
> You know what before I ask more questions can you send me the schematics? My email is in my posts I think. Also if you have a complete write up (not necessarily by you) that you know of on the web it would be greatly appreciated. I tried but I'm still confused since what you describe seems to be able to undo some but not all changes to register state, and I'm not seeing how you decide to go from the waiting to retire state to the retire state.

I don't have enough of your e-mail address in the above.
infinis...@gmail.com
I need the characters that ... stands for.

So find me (not hard) and send me a query on this topic and I can
send you a 140-odd page chapter from a book I am writing. It does require
a knowledge of gate/transistor level design and some elementary routing.

It was designed in an era of 3 metal layers and was designed to simplify
all the routing headaches the RF+RAT+ROB+FW created. It is appropriate for
a 20-gates per cycle design, but not a 16-gates per cycle (or faster)
design. When it was designed it WAS appropriate for a 16-gates per cycle
design.........

EricP

unread,

May 15, 2018, 2:51:19 PM5/15/18

to

MitchAlsup wrote:
>
> You could index a PRF with a RAT, but instead of a RAT we used a CAM.
>
> Each PRF register has a current architectural register name. Only one
> name could be valid at any point in time, but there was always one name
> that was valid.
>
> We kept track of this with a valid bit per CAM or per register if you
> prefer. The History Buffer was a store of the valid bits in the cam,
> but to simplify other things (such as branch recovery) we kept two bits
> ber register, one held the valid bits if this instruction issue gets
> retired, the other one holds the state if the issue gets recovered.
>
> So there were 16*128*2 bits in the history buffer
>

> <snip>

>
> The valid bits of the CAM are written each cycle into the history buffer.
> CAM entries that are not waiting for their result and remaining arch-
> tecturally visible are suitable for having their valid bits flip from
> present to waiting. CAM entries which cannot be seen by issuing instructions
> and are not waiting for their result are available for allocation. These

> entries are allocated by flipping their valid bit from invisible to waiting..

>
> Prior to an issue, the current valid bits in the CAM and the previous issue
> allocation are written into the history buffer for recovery.
>
> As issue are retired, the history buffer entry corresponding to the prior
> state of the CAM are ORed into the currently available register list.
> As an issue is recovered, the other history data is ORed into the available list.
> At the same time the entries being removed from the currently available list
> are NANDed off the same list.

How would you do branch mispredict recovery,
as soon as it resolved, or at branch instruction commit?
Or a bit of both?

Lets see... we have to recover the register map at that checkpoint
(which is just copying the appropriate history vector back to the CAM),
shut down all FU's with work-in-process (which might have
various complications for multi-cycle multipliers),
sync the operation cancels with the FU's,
inhibit write-back for pending registers and re-enable later,
recover all pending physical registers to a free state,
and then fix the new fetch PC and begin fetching.

Recovery as soon as recognized could eliminate some latency,
but looks like it would be quite complex
(because you have to separate out the in-flight operations and
registers that you want to keep from the ones you want to toss).
Also its not clear to me that the latency is that bad.

Recovery at commit looks like it would be must simpler
because everything has resolved - all future operations
and uncommitted register states can be reset.

One could do a bit of both by beginning the instruction
buffer fetch as soon as recognized and then holding the
new instruction stream at decode until the branch hits commit.
But then you have to take care of nested IF's, and only
recognize a branch mispredict if it is the next one to execute.

Eric

MitchAlsup

unread,

May 15, 2018, 7:04:52 PM5/15/18

to

check

> shut down all FU's with work-in-process (which might have
> various complications for multi-cycle multipliers),

You only have to find and kill those beyond the branch being recovered.

> sync the operation cancels with the FU's,

They are synched automatically by the recovery of the reservation station
youngest instruction pointer.

> inhibit write-back for pending registers and re-enable later,

Nope, don't have to do this--but it will take you probably around a week to figure out why exactly. It is very clever why we don't have to do this.

> recover all pending physical registers to a free state,

Done by writing the CAM bits back to where they were.

> and then fix the new fetch PC and begin fetching.

Also, nope, but most processors would.

We performed 2 fetches per cycle, the predicted fetch and the alternate
fetch. The predicted fetch was issued, the alternate fetch was written into
the branch recover buffer. When the branch executes, the branch buffer is
read out. If the branch is recovered, we have the instructions at the
alternate issue point and can both recover and issue these instructions in a
single cycle. Zero cycle branch recovery ! This is what made GCC (SPEC88)
run so well.

Try that with a RAT.

>
> Recovery as soon as recognized could eliminate some latency,

You can't get better than zero, unless you start executing down both paths.

> but looks like it would be quite complex
> (because you have to separate out the in-flight operations and
> registers that you want to keep from the ones you want to toss).

Nope.

> Also its not clear to me that the latency is that bad.

It is not.

>
> Recovery at commit looks like it would be must simpler
> because everything has resolved - all future operations
> and uncommitted register states can be reset.

I will let this one go, for now.

>
> One could do a bit of both by beginning the instruction
> buffer fetch as soon as recognized and then holding the
> new instruction stream at decode until the branch hits commit.
> But then you have to take care of nested IF's, and only
> recognize a branch mispredict if it is the next one to execute.

See above for a much better way.
>
> Eric

Ivan Godard

unread,

May 15, 2018, 8:12:31 PM5/15/18

to

On 5/15/2018 4:04 PM, MitchAlsup wrote:
> On Tuesday, May 15, 2018 at 1:51:19 PM UTC-5, EricP wrote:

<snip>

.
>> and then fix the new fetch PC and begin fetching.
> Also, nope, but most processors would.
>
> We performed 2 fetches per cycle, the predicted fetch and the alternate
> fetch. The predicted fetch was issued, the alternate fetch was written into
> the branch recover buffer. When the branch executes, the branch buffer is
> read out. If the branch is recovered, we have the instructions at the
> alternate issue point and can both recover and issue these instructions in a
> single cycle. Zero cycle branch recovery ! This is what made GCC (SPEC88)
> run so well.
>
> Try that with a RAT.

So you are fetching both taken and not-taken targets with each branch?
Don't you hose your icache with untaken path targets?

MitchAlsup

unread,

May 15, 2018, 9:35:41 PM5/15/18

to

One does not pursue the alternate path on a cache miss.
One does ... pursue the predicted path on a cache miss.

Quadibloc

unread,

May 16, 2018, 12:10:41 AM5/16/18

to

On Tuesday, May 15, 2018 at 6:12:31 PM UTC-6, Ivan Godard wrote:

> So you are fetching both taken and not-taken targets with each branch?
> Don't you hose your icache with untaken path targets?

Stuff from not-taken branch paths is in your cache? Oh, of course. That's where
Meltdown came from.

John Savard

Anton Ertl

unread,

May 16, 2018, 2:09:43 AM5/16/18

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>Stuff from not-taken branch paths is in your cache? Oh, of course. That's where
>Meltdown came from.

No, Meltdown came from executing instructions on not-taken paths,
ignoring permissions of accesses during these executions, and
revealing the results through the data-cache side-channel.

I would not be surprised if Meltdown could be made to work even
without the not-taken branch trick, albeit less effectively (you would
have to deal with exceptions all the time).

Stephen Fuld

unread,

May 16, 2018, 2:15:15 AM5/16/18

to

If both paths are in cache, don't you require an extra I cache port, and
doesn't it take more power to fetch both paths? Even if those are true,
I am not saying it is a bad trade off, just that there is a trade off.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

infinis...@gmail.com

unread,

May 16, 2018, 6:46:31 AM5/16/18

to

On Sunday, May 13, 2018 at 9:10:14 PM UTC-4, MitchAlsup wrote:
> > Is there a difference between a 'History File' and a history buffer?
>
> Yes, one is a storage means to recover a logical register file,
> The other is a storage means to recover the CAMs of a physical register file.
>

Is this standard terminology or is this of your own invention?

> > > A physical register file is indexed by a CAM at issue and indexed by a
> > > decoder at result delivery.
> >
> > I thought a physical register file is indexed by a LUT at issue?
>
> MY Physical register file is a completely integrated "do the right thing"
> Register file, reorder buffer, RAT, and whatever else was needed to do
> wide multi issue, wide recovery, instruction sequencing. It is not a
> bunch of components carefully sequenced, it is an integrated package
> where all manipulations transpire from the same input wires. For example
> the decoder that writes the result controls both the write of the
> register, but the same select line controls the history buffer logic
> so the HB understands that that result was exception free and can be
> retired. When all instructions in an issue are "performed" the whole
> issue can be retired.
>

Ok this and the rest of your post clears things up more... you calling it a history buffer had me thinking it performed differently from a ROB but it performs the same function just in a different manner. I had initially imagined it was out of order completion with some kind of roll back... but now I understand its a circular buffer just like a ROB it what is stored that is different. OH and also you do seem to have a separate free register list, right?

> You have to see it from the hardware gate/transistor/wire point of view
> rather than how they teach computer architecture in universities these
> days.
>

Why exactly?

> > A simple rename table that keeps track of the youngest version of each architectural register.(indirection) If its done by a CAM what exactly is the contents its being addressed by? Architectural register and sequence number?
>
> You could index a PRF with a RAT, but instead of a RAT we used a CAM.
>
> Each PRF register has a current architectural register name. Only one
> name could be valid at any point in time, but there was always one name
> that was valid.
>
> We kept track of this with a valid bit per CAM or per register if you
> prefer. The History Buffer was a store of the valid bits in the cam,
> but to simplify other things (such as branch recovery) we kept two bits
> ber register, one held the valid bits if this instruction issue gets
> retired, the other one holds the state if the issue gets recovered.
>

Ok so instead of holding what architectural register this instruction is going to retire you hold a bit vector of valid bits. I suppose this does simplify instruction retire for instructions that return more than one result. Which has a smaller memory footprint your history buffer or a 'value free' ROB? Also when an instruction retires is the contents of the history buffer only setting the valid bit of the newly valid mapping or is it also making the older arch mapping invalid as well?

> So there were 16*128*2 bits in the history buffer

2 is the number of valid bits, 128 I assume is the number of registers in the PRF... whats the 16?

>
> > If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?
>
> But your problem disappears if they ARE parallel to each other and
> topologically adjacent.
>

Wouldn't that just move the problem? If you make all the valid bits in the CAM topologically adjacent and parallel then the wiring from the select lines from the CAM need to be wired to it and that would again be a mess.

> > So the history buffer is somehow responsible for figuring out when a currently arch register becomes superseded by a retiring instruction?
>
> The valid bits of the CAM are written each cycle into the history buffer.
> CAM entries that are not waiting for their result and remaining arch-
> tecturally visible are suitable for having their valid bits flip from
> present to waiting. CAM entries which cannot be seen by issuing instructions
> and are not waiting for their result are available for allocation. These
> entries are allocated by flipping their valid bit from invisible to waiting.

Ok so earlier you said 2 bits per entry, but you also said that stores both the valid bits for sucessful retirement and for an mispredict. This seems to imply one bit for each case... which is two states. But you clearly state there are three states they could be in valid, waiting for retirement, and waiting for value whats going on there? Also you seem to explain there are more valid bits in the CAM per entry than there are in the history buffer per entry... does this explain my question?

>
> Prior to an issue, the current valid bits in the CAM and the previous issue
> allocation are written into the history buffer for recovery.
>

I'll have to think about this when I get a little more time, but it seems basic enough at a glance.

> > > So the history buffer is a PRN-width by ExecutionWindow-Deep of cells.
> > > Each cell contains 2 bits of information and 3 or 4 gates to watch
> > > the various select lines; and a accumulating readout port which provides
> > > the available register list.

Isn't the executionwindow the number of physical registers minus the number of arch registers?

> > You know what before I ask more questions can you send me the schematics? My email is in my posts I think. Also if you have a complete write up (not necessarily by you) that you know of on the web it would be greatly appreciated. I tried but I'm still confused since what you describe seems to be able to undo some but not all changes to register state, and I'm not seeing how you decide to go from the waiting to retire state to the retire state.
>
> I don't have enough of your e-mail address in the above.
> infinis...@gmail.com
> I need the characters that ... stands for.
>
> So find me (not hard) and send me a query on this topic and I can
> send you a 140-odd page chapter from a book I am writing. It does require
> a knowledge of gate/transistor level design and some elementary routing.
>
> It was designed in an era of 3 metal layers and was designed to simplify
> all the routing headaches the RF+RAT+ROB+FW created. It is appropriate for
> a 20-gates per cycle design, but not a 16-gates per cycle (or faster)
> design. When it was designed it WAS appropriate for a 16-gates per cycle
> design.........

My email is infinis...@gmail.com I'd appreciate it if you could send it there... thank you for taking the time and effort.

I googled your screenname and got a linkedin page that appears to be yours, your last job was a shadercore architect at samsung right? Wow your resume is impressive, you must be proud.

MitchAlsup

unread,

May 16, 2018, 12:20:20 PM5/16/18

to

Remember this is 1990-1991 and we had not yet run into the power wall.

MitchAlsup

unread,

May 16, 2018, 12:35:26 PM5/16/18

to

On Wednesday, May 16, 2018 at 5:46:31 AM UTC-5, infinis...@gmail.com wrote:
> On Sunday, May 13, 2018 at 9:10:14 PM UTC-4, MitchAlsup wrote:
> > > Is there a difference between a 'History File' and a history buffer?
> >
> > Yes, one is a storage means to recover a logical register file,
> > The other is a storage means to recover the CAMs of a physical register file.
> >
> Is this standard terminology or is this of your own invention?

Probably mine.

>
> > > > A physical register file is indexed by a CAM at issue and indexed by a
> > > > decoder at result delivery.
> > >
> > > I thought a physical register file is indexed by a LUT at issue?
> >
> > MY Physical register file is a completely integrated "do the right thing"
> > Register file, reorder buffer, RAT, and whatever else was needed to do
> > wide multi issue, wide recovery, instruction sequencing. It is not a
> > bunch of components carefully sequenced, it is an integrated package
> > where all manipulations transpire from the same input wires. For example
> > the decoder that writes the result controls both the write of the
> > register, but the same select line controls the history buffer logic
> > so the HB understands that that result was exception free and can be
> > retired. When all instructions in an issue are "performed" the whole
> > issue can be retired.
> >
> Ok this and the rest of your post clears things up more... you calling it a history buffer had me thinking it performed differently from a ROB but it performs the same function just in a different manner. I had initially imagined it was out of order completion with some kind of roll back... but now I understand its a circular buffer just like a ROB it what is stored that is different. OH and also you do seem to have a separate free register list, right?

It is OoO completion, but only instructions on the predicted path(s) are
in the execution window plus these instructions can still be rolled back.
But after result delivery, there is no more processing of the instruction
(Data movement) just processing of state surrounding the instruction.

>
> > You have to see it from the hardware gate/transistor/wire point of view
> > rather than how they teach computer architecture in universities these
> > days.
> >
> Why exactly?
>
> > > A simple rename table that keeps track of the youngest version of each architectural register.(indirection) If its done by a CAM what exactly is the contents its being addressed by? Architectural register and sequence number?

A hit on the CAM asserts a select line that reads out the register.
Just like an equal-detect in a decoder asserts a select line that
reads out a register. You are just replacing 1/2 of the select line
asserters with CAMs (from decoders).

You are CAMing for the current (valid bits) architectural register.
An Encoder on the select line produces the physical register number
if you need to capture data from the result path.

> >
> > You could index a PRF with a RAT, but instead of a RAT we used a CAM.
> >
> > Each PRF register has a current architectural register name. Only one
> > name could be valid at any point in time, but there was always one name
> > that was valid.
> >
> > We kept track of this with a valid bit per CAM or per register if you
> > prefer. The History Buffer was a store of the valid bits in the cam,
> > but to simplify other things (such as branch recovery) we kept two bits
> > ber register, one held the valid bits if this instruction issue gets
> > retired, the other one holds the state if the issue gets recovered.
> >
> Ok so instead of holding what architectural register this instruction is going to retire you hold a bit vector of valid bits. I suppose this does simplify instruction retire for instructions that return more than one result. Which has a smaller memory footprint your history buffer or a 'value free' ROB?

History buffer.

> Also when an instruction retires is the contents of the history buffer only setting the valid bit of the newly valid mapping or is it also making the older arch mapping invalid as well?

Valid bits are manipulated at issue (into HB) and at retire (onto Available)
but not at result delivery.

> > So there were 16*128*2 bits in the history buffer
> 2 is the number of valid bits, 128 I assume is the number of registers in the PRF... whats the 16?
>
> >
> > > If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?
> >
> > But your problem disappears if they ARE parallel to each other and
> > topologically adjacent.
> >
> Wouldn't that just move the problem? If you make all the valid bits in the CAM topologically adjacent and parallel then the wiring from the select lines from the CAM need to be wired to it and that would again be a mess.

One set of wires heads towards the RF the other heads towards the HB.
At no point in time are there more than 6 wires wide per bit.

>
> > > So the history buffer is somehow responsible for figuring out when a currently arch register becomes superseded by a retiring instruction?
> >
> > The valid bits of the CAM are written each cycle into the history buffer.
> > CAM entries that are not waiting for their result and remaining arch-
> > tecturally visible are suitable for having their valid bits flip from
> > present to waiting. CAM entries which cannot be seen by issuing instructions
> > and are not waiting for their result are available for allocation. These
> > entries are allocated by flipping their valid bit from invisible to waiting.
>
> Ok so earlier you said 2 bits per entry, but you also said that stores both the valid bits for sucessful retirement and for an mispredict. This seems to imply one bit for each case... which is two states. But you clearly state there are three states they could be in valid, waiting for retirement, and waiting for value whats going on there?

There are 3 states a PRF entry can have (Architectural, invisible, waiting)
A bit in the history buffer has to recover the "it retired" mappings or
the "It was recovered" mappings depending on whether the packet of inst-
ructions retired or was recovered. Different state.

> Also you seem to explain there are more valid bits in the CAM per entry than there are in the history buffer per entry... does this explain my question?

Its the other way around.

> >
> > Prior to an issue, the current valid bits in the CAM and the previous issue
> > allocation are written into the history buffer for recovery.
> >
> I'll have to think about this when I get a little more time, but it seems basic enough at a glance.
>
> > > > So the history buffer is a PRN-width by ExecutionWindow-Deep of cells.
> > > > Each cell contains 2 bits of information and 3 or 4 gates to watch
> > > > the various select lines; and a accumulating readout port which provides
> > > > the available register list.
> Isn't the executionwindow the number of physical registers minus the number of arch registers?
>
> > > You know what before I ask more questions can you send me the schematics? My email is in my posts I think.

You e-mmail cannot possibly be "infinis...@gmail.com"
So I am missing what characters were hidden by ...

> Also if you have a complete write up (not necessarily by you) that you know of on the web it would be greatly appreciated. I tried but I'm still confused since what you describe seems to be able to undo some but not all changes to register state, and I'm not seeing how you decide to go from the waiting to retire state to the retire state.
> >
> > I don't have enough of your e-mail address in the above.
> > infinis...@gmail.com
> > I need the characters that ... stands for.
> >
> > So find me (not hard) and send me a query on this topic and I can
> > send you a 140-odd page chapter from a book I am writing. It does require
> > a knowledge of gate/transistor level design and some elementary routing.
> >
> > It was designed in an era of 3 metal layers and was designed to simplify
> > all the routing headaches the RF+RAT+ROB+FW created. It is appropriate for
> > a 20-gates per cycle design, but not a 16-gates per cycle (or faster)
> > design. When it was designed it WAS appropriate for a 16-gates per cycle
> > design.........
>
> My email is infinis...@gmail.com I'd appreciate it if you could send it there... thank you for taking the time and effort.
>
> I googled your screenname and got a linkedin page that appears to be yours, your last job was a shadercore architect at samsung right? Wow your resume is impressive, you must be proud.

I just happy to have survived........

infinis...@gmail.com

unread,

May 16, 2018, 2:22:07 PM5/16/18

to

On Wednesday, May 16, 2018 at 12:35:26 PM UTC-4, MitchAlsup wrote:

> > Ok this and the rest of your post clears things up more... you calling it a history buffer had me thinking it performed differently from a ROB but it performs the same function just in a different manner. I had initially imagined it was out of order completion with some kind of roll back... but now I understand its a circular buffer just like a ROB it what is stored that is different. OH and also you do seem to have a separate free register list, right?
>
> It is OoO completion, but only instructions on the predicted path(s) are
> in the execution window plus these instructions can still be rolled back.
> But after result delivery, there is no more processing of the instruction
> (Data movement) just processing of state surrounding the instruction.
>

Whoop I'm meant OoO retirement with rollback not completion.

> > > > A simple rename table that keeps track of the youngest version of each architectural register.(indirection) If its done by a CAM what exactly is the contents its being addressed by? Architectural register and sequence number?
>
> A hit on the CAM asserts a select line that reads out the register.
> Just like an equal-detect in a decoder asserts a select line that
> reads out a register. You are just replacing 1/2 of the select line
> asserters with CAMs (from decoders).
>
> You are CAMing for the current (valid bits) architectural register.
> An Encoder on the select line produces the physical register number
> if you need to capture data from the result path.

My post seems to be mangled this wasn't in my last post.

> > Ok so instead of holding what architectural register this instruction is going to retire you hold a bit vector of valid bits. I suppose this does simplify instruction retire for instructions that return more than one result. Which has a smaller memory footprint your history buffer or a 'value free' ROB?
>
> History buffer.
>

Really if you have 128 registers then each entry in your history buffer would have 256bits... I would think a value free ROB would be cheaper in regards to size.

> > Also when an instruction retires is the contents of the history buffer only setting the valid bit of the newly valid mapping or is it also making the older arch mapping invalid as well?
>
> Valid bits are manipulated at issue (into HB) and at retire (onto Available)
> but not at result delivery.
>

At retire the history buffer is accessed right? So what I'm asking is how is the old mapping invalidated? It would seem you would need to use the CAM again in some way or the other.

> > > So there were 16*128*2 bits in the history buffer
> > 2 is the number of valid bits, 128 I assume is the number of registers in the PRF... whats the 16?
> >

I'm asking the above again since you didn't reply and its confusing me since I would think the instruction window (my assumption) should be related to the PRF length.

> > >
> > > > If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?
> > >
> > > But your problem disappears if they ARE parallel to each other and
> > > topologically adjacent.
> > >
> > Wouldn't that just move the problem? If you make all the valid bits in the CAM topologically adjacent and parallel then the wiring from the select lines from the CAM need to be wired to it and that would again be a mess.
>
> One set of wires heads towards the RF the other heads towards the HB.
> At no point in time are there more than 6 wires wide per bit.
> >

If the valid bits of the CAM are located in the CAM layout wise then you'd have 2 wide x PRFlength length bits being routed to a history buffer bus which will be historybufferentrywidth wide. That would make for a messy route. On the otherhand if you move all the valid bits layout wide in the CAM out of the CAM and in a row like you suggest then you'd need to route the wires of the comparators would get extremely messy... and if you multi issue then you'd need more than one comparator set per register.

> > Ok so earlier you said 2 bits per entry, but you also said that stores both the valid bits for sucessful retirement and for an mispredict. This seems to imply one bit for each case... which is two states. But you clearly state there are three states they could be in valid, waiting for retirement, and waiting for value whats going on there?
>
> There are 3 states a PRF entry can have (Architectural, invisible, waiting)
> A bit in the history buffer has to recover the "it retired" mappings or
> the "It was recovered" mappings depending on whether the packet of inst-
> ructions retired or was recovered. Different state.
>

What happened to waiting to retire and waiting for result...? those seem to be two distinct states.

>
> You e-mmail cannot possibly be "infinis...@gmail.com"
> So I am missing what characters were hidden by ...
>
> >

> > My email is infinis...@gmail.com I'd appreciate it if you could send it there... thank you for taking the time and effort.
> >
> > I googled your screenname and got a linkedin page that appears to be yours, your last job was a shadercore architect at samsung right? Wow your resume is impressive, you must be proud.
>
> I just happy to have survived........

In my last post the email address seems to show up correctly on my desktop but not on my tablet. So you can try your desktop if you want otherwise I'll put it here again in a different format.
infinisearch99 at gmail dot com

Stephen Fuld

unread,

May 16, 2018, 4:56:10 PM5/16/18

to

OK, I didn't realize that.

Thinking about it as little more, given that today's branch predictors
give you not only the predicted direction, but a measure of the
confidence of that prediction, would it make sense to fetch both paths
only on branches where the confidence in the prediction was low? This
would reduce average power, over fetching both paths all the time, but
might help performance enough to be worth the trouble.

already...@yahoo.com

unread,

May 17, 2018, 4:30:51 AM5/17/18

to

On Wednesday, May 16, 2018 at 9:09:43 AM UTC+3, Anton Ertl wrote:
>
> I would not be surprised if Meltdown could be made to work even
> without the not-taken branch trick, albeit less effectively (you would
> have to deal with exceptions all the time).
>

It sounds like you didn't read a Meltdown paper.
Possibility of exploit via branch misprediction is just mention at the beginning.
The bulk of the paper is dedicated to two POCs that don't use branch misprediction at all.

MitchAlsup

unread,

May 17, 2018, 2:26:22 PM5/17/18

to

A) I'm not sure fetching both ways would be power efficient enough with
today's high accuracy predictors.

B) I'm not sure the pipeline depths that enabled all this could be made
in today's technology (wire delay > gate delay)

C) for doing the HP FP vector-like codes (MATRIX300) there are better
ways to insert instructions into the execution window. This leaves only
the goal of HP spaghetti code--which is predictor accuracy limited.

MitchAlsup

unread,

May 17, 2018, 2:33:35 PM5/17/18

to

On Wednesday, May 16, 2018 at 1:22:07 PM UTC-5, infinis...@gmail.com wrote:
> On Wednesday, May 16, 2018 at 12:35:26 PM UTC-4, MitchAlsup wrote:
>
> > > Ok this and the rest of your post clears things up more... you calling it a history buffer had me thinking it performed differently from a ROB but it performs the same function just in a different manner. I had initially imagined it was out of order completion with some kind of roll back... but now I understand its a circular buffer just like a ROB it what is stored that is different. OH and also you do seem to have a separate free register list, right?
> >
> > It is OoO completion, but only instructions on the predicted path(s) are
> > in the execution window plus these instructions can still be rolled back.
> > But after result delivery, there is no more processing of the instruction
> > (Data movement) just processing of state surrounding the instruction.
> >
> Whoop I'm meant OoO retirement with rollback not completion.
>
> > > > > A simple rename table that keeps track of the youngest version of each architectural register.(indirection) If its done by a CAM what exactly is the contents its being addressed by? Architectural register and sequence number?
> >
> > A hit on the CAM asserts a select line that reads out the register.
> > Just like an equal-detect in a decoder asserts a select line that
> > reads out a register. You are just replacing 1/2 of the select line
> > asserters with CAMs (from decoders).
> >
> > You are CAMing for the current (valid bits) architectural register.
> > An Encoder on the select line produces the physical register number
> > if you need to capture data from the result path.
>
> My post seems to be mangled this wasn't in my last post.
>
> > > Ok so instead of holding what architectural register this instruction is going to retire you hold a bit vector of valid bits. I suppose this does simplify instruction retire for instructions that return more than one result. Which has a smaller memory footprint your history buffer or a 'value free' ROB?
> >
> > History buffer.
> >
> Really if you have 128 registers then each entry in your history buffer would have 256bits... I would think a value free ROB would be cheaper in regards to size.

Can you recover it in zero cycles of insert delay?

>
> > > Also when an instruction retires is the contents of the history buffer only setting the valid bit of the newly valid mapping or is it also making the older arch mapping invalid as well?

The mapping changes are only done in the CAM.

> >
> > Valid bits are manipulated at issue (into HB) and at retire (onto Available)
> > but not at result delivery.
> >
> At retire the history buffer is accessed right? So what I'm asking is how is the old mapping invalidated? It would seem you would need to use the CAM again in some way or the other.

Retire means the instruction packet survived the pipeline, and any mapping
changes are now architectural. That is you don't have to do anything but
put invisible registers on the available list.

>
> > > > So there were 16*128*2 bits in the history buffer
> > > 2 is the number of valid bits, 128 I assume is the number of registers in the PRF... whats the 16?
> > >
> I'm asking the above again since you didn't reply and its confusing me since I would think the instruction window (my assumption) should be related to the PRF length.
>
> > > >
> > > > > If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?
> > > >
> > > > But your problem disappears if they ARE parallel to each other and
> > > > topologically adjacent.
> > > >
> > > Wouldn't that just move the problem? If you make all the valid bits in the CAM topologically adjacent and parallel then the wiring from the select lines from the CAM need to be wired to it and that would again be a mess.
> >
> > One set of wires heads towards the RF the other heads towards the HB.
> > At no point in time are there more than 6 wires wide per bit.
> > >
> If the valid bits of the CAM are located in the CAM layout wise then you'd have 2 wide x PRFlength length bits being routed to a history buffer bus which will be historybufferentrywidth wide.

2 bits wide per PRF--each PRF is 6 wires wide (select lines) so there is
plenty or routing area--and the route is straight.

> That would make for a messy route. On the otherhand if you move all the valid bits layout wide in the CAM out of the CAM and in a row like you suggest then you'd need to route the wires of the comparators would get extremely messy... and if you multi issue then you'd need more than one comparator set per register.

If you move the valid bits out of the CAM you need to route the select
lines to the valid bits so they are updated properly.

>
> > > Ok so earlier you said 2 bits per entry, but you also said that stores both the valid bits for sucessful retirement and for an mispredict. This seems to imply one bit for each case... which is two states. But you clearly state there are three states they could be in valid, waiting for retirement, and waiting for value whats going on there?
> >
> > There are 3 states a PRF entry can have (Architectural, invisible, waiting)
> > A bit in the history buffer has to recover the "it retired" mappings or
> > the "It was recovered" mappings depending on whether the packet of inst-
> > ructions retired or was recovered. Different state.
> >
> What happened to waiting to retire and waiting for result...? those seem to be two distinct states.

Instructions wait for result
Packets wait for retire (all issued instruction have quit waiting and none
are exceptional.)

infinis...@gmail.com

unread,

May 17, 2018, 6:58:47 PM5/17/18

to

On Thursday, May 17, 2018 at 2:33:35 PM UTC-4, MitchAlsup wrote:

> The mapping changes are only done in the CAM.
> > >
> > > Valid bits are manipulated at issue (into HB) and at retire (onto Available)
> > > but not at result delivery.
> > >
> > At retire the history buffer is accessed right? So what I'm asking is how is the old mapping invalidated? It would seem you would need to use the CAM again in some way or the other.
>
> Retire means the instruction packet survived the pipeline, and any mapping
> changes are now architectural. That is you don't have to do anything but
> put invisible registers on the available list.

Ok let me rephrase this. In cycle N there are n valid bits set in the CAM where n is the number of architectural registers. In cycle N one instruction was retired so to maintain n valid bits being set in cycle N+1 one valid bit has to be unset and one valid bit has to be set. What I'm asking is how to choose which CAM entry's valid bit to unset.

In addition the CAM contains the the arch register file state that reflects the oldest instructions in the instruction window. So if you sample the CAM at issue time you're not necessarily sampling the arch register state for the instruction in the previous issue. So how do you get a newly issued instruction to use youngest register value as sources? It seems to me you would still need something else besides a CAM.

> >
> > > > > So there were 16*128*2 bits in the history buffer
> > > > 2 is the number of valid bits, 128 I assume is the number of registers in the PRF... whats the 16?
> > > >
> > I'm asking the above again since you didn't reply and its confusing me since I would think the instruction window (my assumption) should be related to the PRF length.
> >
> > > > >
> > > > > > If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?
> > > > >
> > > > > But your problem disappears if they ARE parallel to each other and
> > > > > topologically adjacent.
> > > > >
> > > > Wouldn't that just move the problem? If you make all the valid bits in the CAM topologically adjacent and parallel then the wiring from the select lines from the CAM need to be wired to it and that would again be a mess.
> > >
> > > One set of wires heads towards the RF the other heads towards the HB.
> > > At no point in time are there more than 6 wires wide per bit.
> > > >
> > If the valid bits of the CAM are located in the CAM layout wise then you'd have 2 wide x PRFlength length bits being routed to a history buffer bus which will be historybufferentrywidth wide.
>
> 2 bits wide per PRF--each PRF is 6 wires wide (select lines) so there is
> plenty or routing area--and the route is straight.
>
> > That would make for a messy route. On the otherhand if you move all the valid bits layout wide in the CAM out of the CAM and in a row like you suggest then you'd need to route the wires of the comparators would get extremely messy... and if you multi issue then you'd need more than one comparator set per register.
>
> If you move the valid bits out of the CAM you need to route the select
> lines to the valid bits so they are updated properly.
> >

I'll read the chapters you emailed me before I respond to this.

> > What happened to waiting to retire and waiting for result...? those seem to be two distinct states.
>
> Instructions wait for result
> Packets wait for retire (all issued instruction have quit waiting and none
> are exceptional.)
> >

I thought you said this was all in your PRF so what are these packets you speak of? You said your PRF took the place of a ROB... so if everything is in the PRF then each entry can be unallocated, waiting for result, waiting for retirement, and arch register. So now what are you talking about with Packets wait for retire?

MitchAlsup

unread,

May 17, 2018, 8:18:41 PM5/17/18

to

On Thursday, May 17, 2018 at 5:58:47 PM UTC-5, infinis...@gmail.com wrote:
> On Thursday, May 17, 2018 at 2:33:35 PM UTC-4, MitchAlsup wrote:
>
> > The mapping changes are only done in the CAM.
> > > >
> > > > Valid bits are manipulated at issue (into HB) and at retire (onto Available)
> > > > but not at result delivery.
> > > >
> > > At retire the history buffer is accessed right? So what I'm asking is how is the old mapping invalidated? It would seem you would need to use the CAM again in some way or the other.
> >
> > Retire means the instruction packet survived the pipeline, and any mapping
> > changes are now architectural. That is you don't have to do anything but
> > put invisible registers on the available list.
>
> Ok let me rephrase this. In cycle N there are n valid bits set in the CAM where n is the number of architectural registers. In cycle N one instruction was retired so to maintain n valid bits being set in cycle N+1 one valid bit has to be unset and one valid bit has to be set. What I'm asking is how to choose which CAM entry's valid bit to unset.

In cycle N-K there were 5 instructions issued. In cycle N in order to retire
any of these instructions one has to retire all 5 of them. There is not a
mechanism to separate instructions after they have been packetized--except
by recovering to the previous issue ans then single stepping through the
5 instructions read out of the normal cache, and packetized one instruction
per packet.

If you put m instructions into the execution window, and you are considering
retirement of any one of them, you must be considering retirement of all
m of them.

Thus, all of the CAM+HB+Available+few-others are performed in sets of
instructions, not on instructions individually.

>
> In addition the CAM contains the the arch register file state that reflects the oldest instructions in the instruction window.

CAM contains the architectural state appropriate for instructions being
issued in the current clock cycle. HB can be used to recover the CAM and
back up the execution window (and available list).

> So if you sample the CAM at issue time you're not necessarily sampling the arch register state for the instruction in the previous issue. So how do you get a newly issued instruction to use youngest register value as sources? It seems to me you would still need something else besides a CAM.

THe CAM only holds the youngest mappings. The HB contains what the CAM used
to contain.

>
> > >
> > > > > > So there were 16*128*2 bits in the history buffer
> > > > > 2 is the number of valid bits, 128 I assume is the number of registers in the PRF... whats the 16?
> > > > >
> > > I'm asking the above again since you didn't reply and its confusing me since I would think the instruction window (my assumption) should be related to the PRF length.
> > >
> > > > > >
> > > > > > > If there are 128 registers then that's alot of wires (assuming 2 valid bits per register) that aren't parallel to each other. Isn't there a fourth state unallocated?
> > > > > >
> > > > > > But your problem disappears if they ARE parallel to each other and
> > > > > > topologically adjacent.
> > > > > >
> > > > > Wouldn't that just move the problem? If you make all the valid bits in the CAM topologically adjacent and parallel then the wiring from the select lines from the CAM need to be wired to it and that would again be a mess.
> > > >
> > > > One set of wires heads towards the RF the other heads towards the HB.
> > > > At no point in time are there more than 6 wires wide per bit.
> > > > >
> > > If the valid bits of the CAM are located in the CAM layout wise then you'd have 2 wide x PRFlength length bits being routed to a history buffer bus which will be historybufferentrywidth wide.
> >
> > 2 bits wide per PRF--each PRF is 6 wires wide (select lines) so there is
> > plenty or routing area--and the route is straight.
> >
> > > That would make for a messy route. On the otherhand if you move all the valid bits layout wide in the CAM out of the CAM and in a row like you suggest then you'd need to route the wires of the comparators would get extremely messy... and if you multi issue then you'd need more than one comparator set per register.
> >
> > If you move the valid bits out of the CAM you need to route the select
> > lines to the valid bits so they are updated properly.
> > >
>
> I'll read the chapters you emailed me before I respond to this.
>
> > > What happened to waiting to retire and waiting for result...? those seem to be two distinct states.
> >
> > Instructions wait for result
> > Packets wait for retire (all issued instruction have quit waiting and none
> > are exceptional.)
> > >
>
> I thought you said this was all in your PRF so what are these packets you speak of?

Bundles of instructions, pre routed to the appropriate function unit,
with some pre decoded information attached, and with both sequential and
taken fetch addresses manifest.

> You said your PRF took the place of a ROB...

No I said the PRF includes all of the typical mechanisms of the::
RF+ROB+RAT+a few others.

>so if everything is in the PRF then each entry can be unallocated, waiting for result, waiting for retirement, and arch register. So now what are you talking about with Packets wait for retire?

Packets are the organization of instructions in the instruction (n.e.
packet) cache. Packets are also the organization of the master sequencer
either all of the instructions in a packet retire, of we back the machine
up and reissue under different prediction.

The execution window plays with packets not instructions.
Packets amalgamate a number of instructions and simplify sequencing thereof.

Ivan Godard

unread,

May 17, 2018, 8:58:36 PM5/17/18

to

In short, your engine is a VLIW with a NOP-free encoding and variable
but unified latency. The unified means no ROB, and the variable means
you don't need special means to accommodate misses, just stall
everything. One could describe it as a MIMD wavefront machine. Is that a
fair characterization?

Compared to a Mill it needs to be able to wait for results, and
mixed-latency packets slow the contained fast ops (whereas Mill has the
ROB in the compiler). A smart issuer that tried to bundle for uniform
latency might have avoided that last - did you have something like that?

MitchAlsup

unread,

May 17, 2018, 10:05:24 PM5/17/18

to

At a high level is is very VLIW-ish--packets were the unit of issue
and retire.

The level presented to the compiler is std RISC.

> Compared to a Mill it needs to be able to wait for results, and
> mixed-latency packets slow the contained fast ops (whereas Mill has the
> ROB in the compiler). A smart issuer that tried to bundle for uniform
> latency might have avoided that last - did you have something like that?

We actually built packets after execution.

Running out of the combined cache we issued 1 IPC--we also did this
in order to find instructions that caused faults (we called this
replay). This mechanism allowed the main sequencer to only have to
know about packets and not about instructions.

We could run out of the instruction cache at 1-IPC. After we saw the
control flow, we built packets on the predicted path. So the branch
predictor predicted agree-disagree rather than take-don't-take. This
bought a couple percent (88%->90%) in accuracy:: not as much as modern
predictors, but pretty ok in 1990.

Running out of the packet cache we could take a branch every cycle,
we could recover a branch and issue the alternate path every cycle,
and we could retire as many packets are were retireable (unnecessary),
and we could fire as many instructions per cycle as we had reservation
stations (6).

Matrix 300 ran at 5.95 IPC for 2B cycles, and then dropped to 1.6 IPC
when the TLB was taking a miss every 4th cycle. Changing from a 64 entry
FA TLB to a 256 entry direct mapped TLB put this loop back at 5.90 IPC.
At 4B instructions, the TLB started taking a miss every cycle, and
perf dropped to about 1.0 IPC for "a long time". This was on top of
std cache misses on big FP data sets of that era.

We could run the first loop of M300 at 5.95 IPC with every 4th packet
taking a miss in the data cache--the execution window was sized and the
main memory latency was sized to enable this kind of perf.

Our simulations included DRAM refresh in the latency of DRAM access,
but we performed open-page access in 5 cycles as seen on the pins of
the processor (50 ns). Wrong page access was 170 ns, closed page
access was 80ns. The DRAM controller had 2 pending reads and 1 pending
write from which to choose its next access. There were 4 or 16 banks
of DRAM.

We also tried out a "paranoid" memory model, where the CPU did not
release the "write buffer" until a DRAM ACK cam back acknowledging
that the data arrived without error. This cost 2%-ish on banking
applications--but cost 12% of Matrix300 applications.

infinis...@gmail.com

unread,

May 19, 2018, 4:01:50 PM5/19/18

to

On Thursday, May 17, 2018 at 8:18:41 PM UTC-4, MitchAlsup wrote:

> > Ok let me rephrase this. In cycle N there are n valid bits set in the CAM where n is the number of architectural registers. In cycle N one instruction was retired so to maintain n valid bits being set in cycle N+1 one valid bit has to be unset and one valid bit has to be set. What I'm asking is how to choose which CAM entry's valid bit to unset.
>
> In cycle N-K there were 5 instructions issued. In cycle N in order to retire
> any of these instructions one has to retire all 5 of them. There is not a
> mechanism to separate instructions after they have been packetized--except
> by recovering to the previous issue ans then single stepping through the
> 5 instructions read out of the normal cache, and packetized one instruction
> per packet.
>
> If you put m instructions into the execution window, and you are considering
> retirement of any one of them, you must be considering retirement of all
> m of them.
>
> Thus, all of the CAM+HB+Available+few-others are performed in sets of
> instructions, not on instructions individually.
>
> >
> > In addition the CAM contains the the arch register file state that reflects the oldest instructions in the instruction window.
>
> CAM contains the architectural state appropriate for instructions being
> issued in the current clock cycle. HB can be used to recover the CAM and
> back up the execution window (and available list).
>

Ok I don't understand how this works, and to me you seem to contradict what you said earlier. The CAM contains all the rename mappings right? But only a subset of it is considered retired and that is signified by a 'valid bit'. So the only thing that would be addressable by content (arch register number) plus valid bit (which you said is done at issue time) would be the results of the oldest instructions if they are retired in order. That would mean the youngest state (register context that is the result of the previous aka previously youngest instruction issued) needs to be kept track of as well. Is this done with another valid bit? In addition I still don't see how you are unasserting valid bits to get rid of the old arch mapping during retirement.

MitchAlsup

unread,

May 19, 2018, 6:50:24 PM5/19/18

to

On Saturday, May 19, 2018 at 3:01:50 PM UTC-5, infinis...@gmail.com wrote:
> On Thursday, May 17, 2018 at 8:18:41 PM UTC-4, MitchAlsup wrote:
>
> > > Ok let me rephrase this. In cycle N there are n valid bits set in the CAM where n is the number of architectural registers. In cycle N one instruction was retired so to maintain n valid bits being set in cycle N+1 one valid bit has to be unset and one valid bit has to be set. What I'm asking is how to choose which CAM entry's valid bit to unset.
> >
> > In cycle N-K there were 5 instructions issued. In cycle N in order to retire
> > any of these instructions one has to retire all 5 of them. There is not a
> > mechanism to separate instructions after they have been packetized--except
> > by recovering to the previous issue ans then single stepping through the
> > 5 instructions read out of the normal cache, and packetized one instruction
> > per packet.
> >
> > If you put m instructions into the execution window, and you are considering
> > retirement of any one of them, you must be considering retirement of all
> > m of them.
> >
> > Thus, all of the CAM+HB+Available+few-others are performed in sets of
> > instructions, not on instructions individually.
> >
> > >
> > > In addition the CAM contains the the arch register file state that reflects the oldest instructions in the instruction window.
> >
> > CAM contains the architectural state appropriate for instructions being
> > issued in the current clock cycle. HB can be used to recover the CAM and
> > back up the execution window (and available list).
> >
> Ok I don't understand how this works, and to me you seem to contradict what you said earlier. The CAM contains all the rename mappings right?

In UNARY form. If the CAM[k].reg == inst.reg && CAM[k].valid then assert
select line k. The CAM will only hit on architecturally visible registers.

That means if the machine drains to empty, this register will still be
taking hits on that logical register bit-pattern.

> But only a subset of it is considered retired and that is signified by a 'valid bit'.

You seem lost at this point. The CAM->PRF ports read the architecturally
visible state. THe DEC->PRF only write completed results (visible,
invisible; but no longer waiting.)

All the other <fun> stuff is performed in the history buffer, with some help
from the instruction encoding in the packet <format>.

> So the only thing that would be addressable by content (arch register number) plus valid bit (which you said is done at issue time) would be the results of the oldest instructions if they are retired in order.

Has nothing to do with being retired. It has to do with either not being
overwritten (and thereby staying in one PRF entry,) OR from being over-
written. Here, one of 2 things happen:: a) if the PRN is waiting, then
the physical register number (k) is provided so the reservation station
knows which result to watch for, or b) if PRN is not waiting, then one
gets the PRN data via the CAM match. In the 88120 we implemented a third
state where the data was arriving as the CAM is matching. In this case
the PRN is read out, but discarded at forwarding into the RS.

> That would mean the youngest state (register context that is the result of the previous aka previously youngest instruction issued) needs to be kept track of as well.

Yes, one has to track all registers all the time under all conditions.

> Is this done with another valid bit?

History buffer.

> In addition I still don't see how you are unasserting valid bits to get rid of the old arch mapping during retirement.

The CAM is used twice per cycle, in the first 1/2 cycle the CAM is used to
read registers, in the second 1/2 cycle the written registers are asserted
into the CAM and the CAM deasserts its valid bit, at the same time the
unary available list is used to write new CAM entries from the written
registers currently being used to invalidate other CAMs of the same name.

At this point in time, 3-input XOR gates were de rigueur in multiplier
trees, a 5-bit architectural CAM is one 3-input XOR gate, one 2-input XOR
1-input AND gate. These 2 bits are then amalgamated in a NOR gate to yield
a unary select line. This select line can be used to read registers or
invalidate valid bits. Thus the CAM is 2 gates of logical delay, asserting
the logical register names into the CAM was 2.5 gates of delay. So we are
4.5 gates of delay in at the point the CAM knows what to do. As far as
reading goes, the CAM is latched and the read of the PRN happens in the second 1/2 of the cycle (writes happen in the 1/2 half of the cycle).

EricP

unread,

May 20, 2018, 7:52:29 PM5/20/18

to

These patents have inventors, dates, topic and company
that suggest they might be for the unreleased M88120.

The first appears to be the CAM mechanism Mitch is talking about,
dealing with register allocate, tracking and and free.
It works with register sets (you just have to slog through it.)

A data processor having a logical register content-addressable memory
Michael C. Shebanow, Mitchell Alsup, 1991
https://patents.google.com/patent/EP0514763A2

The retire/backtrack mechanism appears related:

Data processor for performing simultaneous instruction
retirement and backtracking
Michael C. Shebanow, Mitchell Alsup, 1991
https://patents.google.com/patent/US5355457A

Eric

Bruce Hoult

unread,

May 21, 2018, 8:18:04 AM5/21/18

to

I *love* patents from high performance processors from the early 90s (or before). All public domain now.

Robert Finch

unread,

Jul 18, 2018, 11:41:01 PM7/18/18

to

I had fun with this topic back in June 2017 before I understood how RiSC-16 was using Tomasulu’s algorithm in renaming registers. Anyway, thinking that there was no register renaming in RiSC-16 I went ahead and came up with a version. I was then surprised to find out there was no difference in the execution of code between new and older versions. Then I went back and studied Tomasulu some more. I also studied the RiSCV-BOOM processor code.

The register renaming mechanism I wrote turned out to use a whole lot of resources (30,000 FPGA LUTs) but it seemed to work. It’s available for a read on Github under my account. \Cores\FT64\trunk\rtl\sandbox\FT64a.v
Most of the code pertinent to register renaming can be found by text searching for ‘RENAME’ in the file.

There are several things with this code that I wasn’t sure about. One of them being recording the target register tag in the ROB for use when the instruction is retired to free up the mapping. I had a lot of trouble understand how the register mappings would be freed. Also questionable is how the maps are reused. The code just assumes the next map is available when it needs to allocate a physical register. This assumption valid only because there are as many register maps as there are ROB entries. However I note there does not need to be as many maps as ROB entries as not all instructions allocate target registers. So the code is wasting transistors in that regard.

winden

unread,

Aug 14, 2018, 3:33:52 PM8/14/18

to

[ Great thread, had to re-read 7 times, each time understanding a bit more ]

And a little bit of full-text searching reveals and old comp.arch post by Mitch on the very sub-topic of CAM-based renaming, too:

https://groups.google.com/forum/#!searchin/comp.arch/mitch$20alsup/comp.arch/JQLJf0u2mbk/x0AuGAOkhlwJ

Path: sparky!uunet!dtix!darwin.sura.net!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!loyola!ross.COM!sparhawk!mitch
From: mi...@sparhawk.com (Mitch Alsup)
Newsgroups: comp.arch
Subject: Register Renaming and Backtrack
Message-ID: <1992Aug6...@ross.COM>
Date: 6 Aug 92 21:42:57 GMT
Sender: ne...@ross.COM
Organization: Mitch Alsup Consulting
Lines: 184
Originator: XXXXX
Nntp-Posting-Host: sparhawk

Register Renaming with Backtrack

By: Mitch Alsup

Date of Invention Oct 1990

Background:

Processors which execute instructions before it is known that the instruction
will actually finish must provide mechanisms to restore the state of the processor
such that the incorrectly executed instructions appear to have NOT executed. There
are several reasons to "Speculatively" execute instructions: for example; Branch
Prediction in SuperScalar machines, Out-Of-Order and Partially-ordered execution
mechanisms. Both of these mechanisms relax the execution order from the strict
"in order" semantics seem by the "architectural" instruction stream so that the
instructions may be dynamically rearanged for improved performance.

Either branch prediction mistakes, or instruction exceptions can cause the execution
order performed by the processor to differ from the semantics demanded by the
architectural processor model. Processors "back up" to a point at or before
the exception/mistake occurred and continue as if the event occurred in natural
architectural order.

Invention:

This invention provides a high performance mechanism whereby the processor can
quickly be put back into its architectural order when a great number of predictions
and/or exceptions can occur.

This invention controls the (re)naming of registers in a superscalar implementation
of a general purpose RISC machine Instruction set Atchitecture.

Invention Architecture:

Consider the following C data structure and Procedures:

1 |# define NUMBER_OF_REGISTERS (32)
2 |# define NUMBER_OF_ISSUES (6)
3 |# define NUMBER_OF_PREDICTIONS (16)
4 |
5 |struct Register_Map { unsigned current[ NUMBER_OF_REGISTERS ],
6 | new[ 2][ NUMBER_OF_ISSUES ],
7 | completed;
8 | } map[ NUMBER_OF_PREDICTIONS ];
9 |
10 |unsigned issue_point,
11 | retire_point;
12 |
13 |struct Instruction { unsigned op_code,
14 | destination,
15 | source1,
16 | source2,
17 | used_d,
18 | used_s1,
19 | used_s2,
20 | mod__d;
21 | } issue[ NUMBER_OF_ISSUES ];
22 |
23 |void map_registers( struct Instruction *i )
24 |{
25 | unsigned n,
26 | new_issue,
27 | completed;
28 |
29 | new_issue = ( issue_point + 1 ) % NUMBER_OF_ISSUES;
30 | completed = map[ issue_point ].completed;
31 | map[ new_issue ] = map[ issue_point ];
32 | map[ new_issue ].completed = FALSE;
33 |
34 | for( n = 0; n < NUMBER_OF_ISSUES; n++, i++ )
35 | if( i->op_code != NO_OPERATION )
36 | {
37 | unsinged destination = i->destination;
38 |
39 | if( i->used_s1 )
40 | i->source1 = map[ issue_point ].current[ i->source1 ];
41 | if( i->used_s2 )
42 | i->source2 = map[ issue_point ].current[ i->source2 ];
43 | if( i->used_d )
44 | i->destination = map[ issue_point ].current[ desitnation ];
45 |
46 | if( i->mod__d )
47 | {
48 | map[ new_issue ].current[ destination ] = map[ issue_point ].new[ completed ][ n ];
49 | map[ new_issue ].new[ 0 ] [ n ] = map[ issue_point ].new[ completed ][ n ];
50 | map[ new_issue ].new[ 1 ] [ n ] = destination;
51 | }
52 | }
53 | else
54 | {
55 | map[ new_issue ].new[ 0 ] [ n ] =
56 | map[ new_issue ].new[ 1 ] [ n ] = map[ issue_point ].new[ completed ][ n ];
57 | }
58 |}

Discussion:

This mechanism maps the incomming logical registers into physical registers and
builds a new entry in the mapping table for the next issue cycle.

Lines 29 through 32 initialize the next map entry to contain the came map as the
current issue point map and initialize the completion variable to has not completed.
This is conservative because all of the instructions in the issue packet may be
of a kind that do not modify the register mapping information (e.g. stores, branches).

Lines 35 and 35 allow each instruction to interrogate the map and make the updates
they require to register their semantics in the map for the next issue cycle.

Lines 39 through 44 read the physical register number in the current map which corresponds
to the logical register from the instruction.

Lines 46 through 51 update the map for the next issue cycle. This update is based on
the "completed" variable. The new register to be allocated comes from the section of
the map indexed by the completion of the the set of instructions which previously used
this map entry (circular queue). Basically, this new field maintains two allocation
fields. The first field is used if the set of instrucitons was discarded from the
machine, while the second is used if the instructions completed execution. This allows
the machine to back up the register mapping and obtain new registers to allocate simply
by manipulating the issue_point variable (in the branch unit). Likewise, if the
instructions were executed to completion, the "other" allocation entry contains the
registers allocated by the previous set of instructions to use this maping slot. Since
these instructions were executed to completion, the "old" registers associated with
the logical registers in these instructions can be used for subsequent allocations.
(The new destination allocated on the previous pass now contain the valid register data.)

Lines 53 through 57 update the allocation map for those slots in the issue logic which
did not have an instruciton associated. This could correspond to an issue constraint,
a function unit constraint, or some other decoding artifact.

A physical register is always in one of three states: Visible, Invisible, Idle.

A visible register if found in the map at the location indexed by its corresponding
logical register. A Visible register becomes invisible when an instruction writes
to the logical register which corresponds to the destination register. This is
accomplished by placing the allocated register in the slot previously occupied
by the Visible register and placing the Visible register in the allocation entry
for the Next_Issue slot which corresponds to the successful completion of this
set of instrucitons.

An invisible register is one which has been overwritten in a logical sense (used as a
destination), but may become visible if the machine is required to backup a branch
misprediction or exception. Invisible regsiters become visible if the machine backs
up to a point prior to their logical modification. Invisible registers become Idle
when the set of instrucitons is known to complete without error.

Therefore: Idle registers are those registers which have no live entry in the allocation
map, and can be selected for allocation. These are grouped and associated to the map
by the new(ly_allocated) fields and indexed by teh completion of the last set of
instructions to occupy this map.

By maintaing two lists of information, one corresponding to successful execution
and the other uncessful execution, One always has new registers to allocate to
newly issuing instructions.

The source code above issues the instructions by sequentially manipulating the map,
actual hardware would manipulate the map in parallel. This requires that the decoder
provide information about the intra-instruction forwarding information for the map
update manipulation. I leave this as an excersise to the reader.

The Invention:

The invention herein described consists of the dual allocation fields indexed by the
completion of the previous set of instructions to occupy this slot. If the instructions
executed properly, the new registers to be allocated are found in the the second set
otherwise they are found in the first set.

The nearest example of this mechanism found in the litterature is the register
allocator in the Floating point Unit in the IBM RS/6000 computer.

This invention improves on the RS/6000 mechanism by allowing for specualtive
execution and rapid backup. In addition, keeping track of whether the instructions
were completed to select a group of registers which can be allocated to a new issue
process has not been exposed in the litterature.

Grant:

I hearby grant to everyone the right to use this invention. This invention comes
with no warrenty expressed or implied. This grant is subject to someone else already
laying claim to its feature functionality.
-------------------------------------------------------------------------------------
Mitch Alsup Currently at: mi...@ross.com An Independent Consultant
(512)-263-5086 Evenings
Disclaimer: I speak for myself.
-------------------------------------------------------------------------------------

MitchAlsup

unread,

Aug 14, 2018, 9:55:49 PM8/14/18

to

I want to thank you for bringing this back up, I had lost if for "quite
a while".

This particular renamer uses a history attachment to the register map that enables recovering the map "between cycles".

Thanks again.

EricP

unread,

Aug 17, 2018, 9:32:39 PM8/17/18

to

winden wrote:
>
> The nearest example of this mechanism found in the litterature is the register
> allocator in the Floating point Unit in the IBM RS/6000 computer.
>
> This invention improves on the RS/6000 mechanism by allowing for specualtive
> execution and rapid backup. In addition, keeping track of whether the instructions
> were completed to select a group of registers which can be allocated to a new issue
> process has not been exposed in the litterature.

For anyone interested...
The RS/6000 renamer mechanism is described pretty well here,
including how registers are recycled but unfortunately doesn't
detail how a rollback is performed for, say, mispredicted branch
or exceptions.

Machine Organization of the IBM RISC System 6000 Processor 1990
https://pdfs.semanticscholar.org/52f8/6f86237ed109d84ae7ae408c077567c95db0.pdf

EricP

unread,

Aug 17, 2018, 9:32:39 PM8/17/18

to

MitchAlsup wrote:
>
> I want to thank you for bringing this back up, I had lost if for "quite
> a while".
>
> This particular renamer uses a history attachment to the register map that enables recovering the map "between cycles".
>
> Thanks again.

How are the reservation stations selectively canceled and released?

Say we have 16 architecture registers, 64 physical registers,
and 80 entry circular instruction queue. The IQ is full and instruction
40 is a mispredicted branch, so we need to toss instructions 41:79
while keeping 0:40, and selectively cancel pending reservation
station allocations and in-flight function unit executions for
just instructions 41:79.

It would be nice if in 1 clock it could cancel all pending RS
allocations and abort all in-flight FU for those just instructions,
rather than sequence through the list sequentially.

One way I thought of was to have a Cancel bus that is routed
to all RS's and FU's, passing the low and high instruction queue
index numbers to cancel. Each RS and FU already has the IQ index
for its instruction anyway so it can mark it complete.

The Cancel-Abort mechanism has each RS and FU use a pair of
comparators to check if the IQ index of its current instruction
was in the cancel range. If it was, then the RS clears its
in-use flags or the FU aborts.

The IQ is a circular buffer so the compares need to be circular too.
It needs a flag it indicate if the IQ pointers have wrapped around.
So each RS or FU does:
if (!wrapped) then
abort = (myIndex >= lowCancelIndex) & (myIndex <= highCancelIndex)
else
abort = (myIndex <= lowCancelIndex) | (myIndex >= highCancelIndex)
endif

It doesn't seem that more difficult to drive such a cancel bus with
comparitors hanging off it than to drive CAMs on forwardiing busses.

Eric

lk...@lkcl.net

unread,

Dec 25, 2018, 9:27:09 PM12/25/18

to

On Saturday, April 28, 2018 at 8:54:17 AM UTC+1, infinis...@gmail.com wrote:

> I was wondering if anybody/any company has tried to implementing
> scoreboarding with register renaming? It seems like a possibilty...

well! it's funny you should ask... :)

by now you probably noticed that i joined the conversations here
on comp.arch well after this question was asked, and i figured
it might be nice to go back through some threads, and hey look!
this question's near-identical to the discussion(s) of the past
month.

this is an awesome thread, read it all, however i figured it might
be nice to go right back to the beginning, and provide a direct
answer (which summarises as "yes!"), and provide a clear synopsis
of the discussions from december.

* the answer is yes... and you do not need a RAT, or CAMs,
or a Reorder Buffer, or a separate PRF-ARF pre-analysis
pre-allocation phase.

* start with a Register Alias Table (RAT Table). note that
it contains register number, tag, and "valid" bit.

* attach multiple-entry "Reservation Station Table" style
rows to each Function Unit of a *standard* scoreboard.
give *EACH ROW* a unique tag name (which is stored in the
RAT, above).

so far, we have a bog-standard RAT system that is commonly
used to give register-renaming to scoreboards... here's where
it gets different:

* note that the RAT table "tag" is effectively a reverse
lookup variant of a Tomasulo Reorder Buffer row number.

* where a Tomasulo Reservation Station stores a reference
to a ROB row number, in the *standard* RAT-scoreboard
style Reservation Station, the RAT tag points to the
RS *row*, i.e. in the reverse direction to a ROB#.

(note that the RAT tag points not to the RS *table*, it points
to the row *in* the table).

* now take the RAT-style Reservation Station tables and
FLATTEN them, *and* add one Function Unit *per RS entry*.

FU-add1 FU-mul1
RSadd1.1 RSmul1.1
RSadd1.2 RSmul1.2

becomes:

FU-add1 FU-add2 FU-mul1 FU-mul2
RSadd1.1 RSadd1.2 RSmul1.1 RSmul1.2

effectively we may view this as: the FUs *become* in effect
the RS's. or, another way to view it: FUs gain src register
latches (if they don't have them already)

* now, recall that the Reservation-Station-aka-Function-Unit
has, just as in any standard scoreboard, an FU-to-Register
Matrix and an FU-to-FU Matrix.

* recall that in the RAT system, the tag indicates where,
on a new instruction issue, the Function Unit allocated
to handle that instruction should look for its src
register.

* remember that this relationship is *exactly* an FU-to-Register
relationship...

* ... and that it is a unique one. it is NOT a CAM, in
other words.

* therefore, where we previously had a RAT "tag", we may
REPLACE that with a mutually-exclusive bit-array (bit
vector), of length EQUAL to the number of Function Units,
where ONLY ONE bit of that array is permitted to be ON,
and all other bits are required to be OFF.

(all bits "off" is also permitted, therefore we do not
need the "valid" bit of a standard RAT).

* note that this bit-array of length equal to FUs is in
a table of length equal to the number of registers.

* therefore we may drop this on top of the FU-to-Reg
Dependency Matrix! i.e. it becomes a simple (single-latch)
addition to each FU-Reg Cell

that's basically it. that's all there is to it. i deliberately
went through it like i did, above, because it shows how to
directly morph a pre-existing well-known *proven* technique
(Register Alias Tables as applied to scoreboards) into one which
completely does away with CAMs, RATs, PRFs, ARFs and ROBs.

to repeat the key characteristics again:

* multiply up (double or triple) the number of Function Units

* add RS-style src register "latches" to each FU [they should
already exist anyway]

* to the FU-to-Reg Dependency Matrix (FU being rows, Regs being
columns) add one extra flag-bit per cell, that, along each
row ONLY ONE of those bits is permitted to be set in any given
clock cycle.

in this way each FU has a ROB-style destination number record
(just "pointing" in the *opposite* direction to a ROB... this
is the primary reason why only a single bit is required,
as opposed to a CAM).

that really is all there is to it. whenever a new instruction
is issued, the mutually-exclusive (new) bit vector is checked
in that register's column.

if there is no bit set, then it indicates (just like with the
RAT table "valid" bit) that the value in the Register File may
be read as a source register.

so, the FU may happily read whatever source register(s) it needs,
as long as they are valid, and stores them in the FU's src
reg "latches" [that we earlier said were normally part of an
RS's row, which are now *part* of the FU if they aren't already].

if any of the src registers *do* have a bit set, it tells us a
number of things:

(1) (with a single bit) that there exists a Function Unit which
has NOT YET generated the result that we want in this
particular register (or, more specifically the dependent FU
has not stored the result back in the Reg File yet)

(2) with a *SINGLE BIT*, it tells us *EXACTLY* which Function
Unit we need to make this current instruction dependent
on, for that src register

therefore, instead of storing the value in the latch (because
it's not ready yet), we just... um... do exactly what is normally
done in a scoreboard: set up an absolutely standard FU-to-FU
RaW dependency link.

lastly, as part of the dependency-setup, the mutually-exclusive
FU bit is set in (row=FU col=DestReg) to indicate to future
instructions that they should look to *this* FU for this DestReg
if they in turn need it as a src reg.

if there happened to be a bit set previously, then its old state
is read *by* the instruction, and the new (changed) state is
updated for use on the *next* cycle *NOT* the current one.

it's real simple.

no CAMs. no more RAT. RS is merged into FU. the PRF *is* the
ARF, just as in the Tomasulo Algorithm... except there's no
ROB, either.

the only difficulty encountered with this scheme is: how to make
it multi-issue. for a "standard" architecture i have not thought
that through.

instead, what we decided to do is to stratify the entire register
file into banks, and for the number of banks to be *equal* to the
issue limit.

thus we allocate registers 0 4 8 12.... to bank 0, registers 1 5 9 13
to bank 1 etc.

and, then, the issue is one *per destination* register. thus, in
a 4-issue scheme, if we have the following instructions:

ADD r1, r1, 5 # bank1
ADD r2, r2, 5 # bank2
ADD r7, r7, 5 # bank3
ADD r4, r4, 5 # bank0

these are all on SEPARATE destination banks, and thus may be issued
all in the same cycle. however THIS may NOT:

ADD r1, r1, 5 # bank1
ADD r5, r2, 5 # bank1
ADD r9, r7, 5 # bank1
ADD r13, r4, 5 # bank1

these are ALL on bank 1, and thus are required to be issued
in FOUR separate cycles.

basically it comes down to the fact that if you go back to the
original RAT+scoreboard, only one tag may be checked and set
in one cycle, for each register.

so if you have these four instructions (and a 4-issue machine)

ADD r1, r2, 5
ADD r2, r1, 5
ADD r1, r2, 5
ADD r2, r1, 5

you have a bit of a problem, because you need to allocate
the first instruction to the first available FU, and set
dest r1 as its RAT, set R1 as the *source* dependency,
then grab a *third* FU with the *second* as its source
dependency (on R2 this time), *destroying* the tag entry
created for the *first* instruction... oh and then doing
the same thing for the fourth instruction, destroying
(overwriting) the tag for R2.

absolute awful mess!

if someone knows of a solution to *that* problem (how to make
*that* multi-issue) then *EXACTLY* the same solution may be
applied to the above scheme.

i simply cheated by entirely avoiding the problem, by stratifying
into four banks, such that, as long as there are no bank-overlaps,
four destination registers (and thus four of the bit-vectors
equivalent to RAT table entries) may be set, easily, with very
little logic, in one cycle.

l.

already...@yahoo.com

unread,

Dec 26, 2018, 7:51:38 AM12/26/18

to

The CPU described in the paper is what later became known as POWER1 ?