virtual-architectural register renaming in front of 6600 style scoreboards

274 views
Skip to first unread message

luke.l...@gmail.com

unread,
Oct 28, 2020, 12:14:38 PM10/28/20
to
hi folks, been a while.

got a bit of a conundrum. i initially believed that 6600 scoreboards contained all the (implicit, FU latch based) renaming needed to not require any form of separation between architectural (ISA) register files and the register numbers used in the FU-REGs Dependency Matrix (known in Tomasulo as the ROB Row number).

however a simple loop around the following three instructions shows this not to be the case:

LD r9, ....
ADD r9, r9, 100
ST r9, ....

the deficiency (lack of renaming) hits on the 2nd loop where the ST is in-flight and hits the ADD. the WaW between the LD and ADD causes unnecessary RaW stalls (hold-up) where renaming breaks the dependency chain and allows full independent overlap between loops.

initially i believed that the 6600 Dependency Matrices would "solve" this however it is a WaW scenario, and i came up with an extremely complex and costly scheme to solve it, last year, which i would prefer not to use unless there is no other choice.

i do not have a problem with adding a virtual register (renaming) cache however i would greatly prefer that it not involve CAMs, after learning from Mitch last year of the power benefits of using unary encodings, as well as highlighting how easy it is to drop multi-issue on top (raise multiple unary bits) and we need multi-issue.

The problem we have for LibreSOC is this: the number of registers, because it is a hybrid GPU, is a whopping 128 INTs and 128 FP regs *and that is architectural* i.e. real ISA addressable regs, not virtual ones.

256 ISA regs is getting to the point where an unary-driven multi-issue lookup table, going from 256 ARF to 64 VRF is alarmingly high *and still hasn't got renaming in it*!

how can this be solved?

what schemes exist that can cope with 256 registers at the ISA level, can do multi-issue, provide renaming (detect and mitigate the RaW after WaW bottleneck) and preferably do not involve CAMs?

OR

have we just comoletely missed something that really does mean that this is a non-issue, and standard precise-capable scoreboards are perfectly sufficient?

as always grateful for the discussion and insights

l.


EricP

unread,
Oct 28, 2020, 1:11:22 PM10/28/20
to
If you've got 128 regs to play with, wouldn't it be easier to teach
the compiler register allocator to round robin is allocations?

One inexpensive renaming method is to just extend the
Architecture Register Number with some version number bits
to give the Physical Register Number.
If the ARN is 4 bits and it is extended with a 2 bit version
{ARN-4,VER-2} then there is a direct map from logical to
physical so no mapping table is required.
With this you could have 4 versions of each register
in flight without a stall occurring.

It is simple but quite inefficient and wasteful
as 99% of the physical registers sit unused most of the time.

The fly in that approach is your 128 architecture registers,
times 2 for the floats, times the number of versions.
Having two banks of 512 physical registers, mostly unused,
is... unappealing.

Also when you introduce renaming, even a light version,
you also must consider conditional branch mispredict rollback.
It would have to snapshot the version number of each
arch register for each conditional branch instruction,
and restore those version numbers on mispredict.
And you need one such snapshot for each conditional branch
that can be in-flight.


MitchAlsup

unread,
Oct 28, 2020, 2:23:21 PM10/28/20
to
On Wednesday, October 28, 2020 at 11:14:38 AM UTC-5, luke.l...@gmail.com wrote:
> hi folks, been a while.
>
> got a bit of a conundrum. i initially believed that 6600 scoreboards contained all the (implicit, FU latch based) renaming needed to not require any form of separation between architectural (ISA) register files and the register numbers used in the FU-REGs Dependency Matrix (known in Tomasulo as the ROB Row number).
>
> however a simple loop around the following three instructions shows this not to be the case:
>
> LD r9, ....
> ADD r9, r9, 100
> ST r9, ....
>
> the deficiency (lack of renaming) hits on the 2nd loop where the ST is in-flight and hits the ADD. the WaW between the LD and ADD causes unnecessary RaW stalls (hold-up) where renaming breaks the dependency chain and allows full independent overlap between loops.

One way to break the dependence is to recognize that the R9 from the LD is
dead after being consumed as an operand to the ADD, and there is a new R9
from the ADD which is then consumed by the ST (but is not known to be dead
until the LD becomes "architectural".

Thus there are 2 R9s--the one from the LD and the one from the ADD. Given
that the R9 from the LD is dead at the next instruction, there is only one
R9 that needs to be "tracked" (the second one). In order to make use of this
you have to guarantee that the LD can forward its result to the ADD as an
operand without having gone through the register file with a write which is
then read from the ADD. If you can also guarantee that the ADD can forward
to the ST, you can defer the write of the ADD result until you know that
the next-loops LD will happen. And at this point, there are no register
write dependencies on a per loop basis, only one at the end of the loop.

So, you defer writes to R9 from LD and ADD, and you subsequently cancel
those deferred writes when you know no exceptions have transpired and
the loop it to have at least another iteration. This will require the
calculation units to have a few results waiting for writes into the RF
and a new cancel write mechanism.
>
> initially i believed that the 6600 Dependency Matrices would "solve" this however it is a WaW scenario, and i came up with an extremely complex and costly scheme to solve it, last year, which i would prefer not to use unless there is no other choice.
>
> i do not have a problem with adding a virtual register (renaming) cache however i would greatly prefer that it not involve CAMs, after learning from Mitch last year of the power benefits of using unary encodings, as well as highlighting how easy it is to drop multi-issue on top (raise multiple unary bits) and we need multi-issue.

If you are issuing the "whole" loop in a single cycle, you are in a position
to know about the LD result being dead if another loop iteration is required,
and to know that the ADD result is also dead if another iteration is required.
So, you can defer the write based on whether an iteration is actually required
and cancel the writes if it is required (branch shadow cancellation.)
>
> The problem we have for LibreSOC is this: the number of registers, because it is a hybrid GPU, is a whopping 128 INTs and 128 FP regs *and that is architectural* i.e. real ISA addressable regs, not virtual ones.

Above, I added no registers, just more dependence tracking.

luke.l...@gmail.com

unread,
Oct 28, 2020, 4:02:59 PM10/28/20
to
On Wednesday, October 28, 2020 at 5:11:22 PM UTC, EricP wrote:

eric thanks for responding.

> If you've got 128 regs to play with, wouldn't it be easier to teach
> the compiler register allocator to round robin is allocations?

turns out that this is a pain to do (and frowned on, to make a compiler
tied to a microarchitecture's limitations)

> One inexpensive renaming method is to just extend the
> Architecture Register Number with some version number bits
> to give the Physical Register Number.
> If the ARN is 4 bits and it is extended with a 2 bit version
> {ARN-4,VER-2} then there is a direct map from logical to
> physical so no mapping table is required.
> With this you could have 4 versions of each register
> in flight without a stall occurring.

my concern with this approach is, how do you ensure rollover of the version number does not occur? i.e how to detect that if say a 5th rename occurs you do not corrupt or try to use the 1st which may still not be committed?

tracking this is where i started going "um" and "err" :)

> It is simple but quite inefficient and wasteful
> as 99% of the physical registers sit unused most of the time.

in a GPU which is one of the dual tasks this is for those 128 regs which will often be doing 4x4FP vectors utilisation will be pretty high, because the power penalty for spill is so drastic in a GPU workload.

> The fly in that approach is your 128 architecture registers,
> times 2 for the floats, times the number of versions.
> Having two banks of 512 physical registers, mostly unused,
> is... unappealing.

indeed :)

> Also when you introduce renaming, even a light version,
> you also must consider conditional branch mispredict rollback.

ok. so we're using Shadowing. so the hold-and-cancellation from the Shadow Matrices when "godie" is called would need to propagate back through to the register cache to say that the virtual reg allocation is now free for reuse, on all virtual registers allocated to all shadow-cancelled instructions, and no "damage" occurs (no rollback needed either. it means a lot of inflight RSes but that is tolerable)

does that sound workable?

l.

luke.l...@gmail.com

unread,
Oct 28, 2020, 4:52:04 PM10/28/20
to
On Wednesday, October 28, 2020 at 6:23:21 PM UTC, MitchAlsup wrote:
> On Wednesday, October 28, 2020 at 11:14:38 AM UTC-5, luke.l...@gmail.com wrote:

> > however a simple loop around the following three instructions shows this not to be the case:
> >
> > LD r9, ....
> > ADD r9, r9, 100
> > ST r9, ....

> One way to break the dependence is to recognize that the R9 from the LD is
> dead after being consumed as an operand to the ADD, and there is a new R9
> from the ADD which is then consumed by the ST (but is not known to be dead
> until the LD becomes "architectural".
>
> Thus there are 2 R9s--the one from the LD and the one from the ADD. Given
> that the R9 from the LD is dead at the next instruction, there is only one
> R9 that needs to be "tracked" (the second one). In order to make use of this
> you have to guarantee that the LD can forward its result to the ADD as an
> operand without having gone through the register file with a write which is
> then read from the ADD. If you can also guarantee that the ADD can forward
> to the ST, you can defer the write of the ADD result until you know that
> the next-loops LD will happen.

to be clear: this is just operand forwarding, prioritising that over regfile write-reads?

> And at this point, there are no register
> write dependencies on a per loop basis, only one at the end of the loop.

interesting. wait... the *whole* loop? moo? :) errr so the entire computation remains exclusively in in-flight RSes data (as far as r9 is concerned) until the exit from the loop?

ohhh is this how the vectorisation in 66000 works? if so then after nearly 2 years i might have got it.

> So, you defer writes to R9 from LD and ADD, and you subsequently cancel
> those deferred writes when you know no exceptions have transpired

to clarify, by "cancel" do you mean in Shadowing context, "drop the shadow hold preventing write" rather than "cancel the actual LD"? or something more subtle?

errr oh hang on no you don't, you really do mean actually cancel the writes because they're detected as WaW and now they're done they can be chucked.
> and
> the loop is to have at least another iteration. This will require the
> calculation units to have a few results waiting for writes into the RF

i don't mind that

> and a new cancel write mechanism.

ah. ok so this detection was the complex stuff i thought through 18 months ago, it involved creating a mini "stack" in each FU-REGs DM Cell, where whenever more than one write hazard was noted (on a "stack") this was the detection mechanism by which the previous (below top of "stack") Write hazards could be determined to have been overwritten.

and, thus, when they have done their job of being (exclusively) operand forwarded, the fact that they're not "top dog" combined with noting - and this is incredibly important - that there are *no* outstanding shadows held on any insteuctions in front of it - the write can be safely dropped.

my concern about that mechanism is that it doubles the FU-REGs DM Cell size if you want a 2 entry stack, triples for a 3 entey because it is effectively a duplication of the normal FU-REGs DM Cell in order to record the history (in case of partial shadow cancellation, amongst other things)

however that was my implementation.

is there any other method by which "cancel write" could be implemented which does not involve doubling or tripling the gates needed in FU-REGs?

> If you are issuing the "whole" loop in a single cycle, you are in a position
> to know about the LD result being dead if another loop iteration is required,
> and to know that the ADD result is also dead if another iteration is required.
> So, you can defer the write based on whether an iteration is actually required
> and cancel the writes if it is required (branch shadow cancellation.)

i believe this is a different way of saying that there is a stack in FUREGs Cells.

> > The problem we have for LibreSOC is this: the number of registers, because it is a hybrid GPU, is a whopping 128 INTs and 128 FP regs *and that is architectural* i.e. real ISA addressable regs, not virtual ones.
> Above, I added no registers, just more dependence tracking.

indeed. i think we are talking about the same thing, and i judged my implementation to be quite expensive.

what i wondered was: are there alternatives that cost less gates?

one of the things i did note was that the virtual regfile renaming approach might not be able to detect the "drop write" condition that you note is possible in each loop when using dependency tracking. or if it is, i haven't thought it through yet.

l.

luke.l...@gmail.com

unread,
Oct 28, 2020, 4:58:51 PM10/28/20
to
On Wednesday, October 28, 2020 at 8:52:04 PM UTC, luke.l...@gmail.com wrote:

> and, thus, when they have done their job of being (exclusively) operand forwarded, the fact that they're not "top dog" combined with noting - and this is incredibly important - that there are *no* outstanding shadows held on any instructions in front of it - the write can be safely dropped.

(because if you don't take note of outstanding shadows issued ahead of the instruction below top-of-stack, those could be shadow-cancelled, making THIS write now "top dog" but you already threw it away and it all goes to s***)

l.

MitchAlsup

unread,
Oct 28, 2020, 5:09:31 PM10/28/20
to
On Wednesday, October 28, 2020 at 3:02:59 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 5:11:22 PM UTC, EricP wrote:
>
> eric thanks for responding.
>
> > If you've got 128 regs to play with, wouldn't it be easier to teach
> > the compiler register allocator to round robin is allocations?
>
> turns out that this is a pain to do (and frowned on, to make a compiler
> tied to a microarchitecture's limitations)
>
> > One inexpensive renaming method is to just extend the
> > Architecture Register Number with some version number bits
> > to give the Physical Register Number.
> > If the ARN is 4 bits and it is extended with a 2 bit version
> > {ARN-4,VER-2} then there is a direct map from logical to
> > physical so no mapping table is required.
> > With this you could have 4 versions of each register
> > in flight without a stall occurring.
>
> my concern with this approach is, how do you ensure rollover of the version number does not occur? i.e how to detect that if say a 5th rename occurs you do not corrupt or try to use the 1st which may still not be committed?

You cannot commit a new register until its 4th prior use has completed.
1111 writes pending and you are issue blocked,
otherwise you are not--one 4-input AND gate.

MitchAlsup

unread,
Oct 28, 2020, 5:18:27 PM10/28/20
to
On Wednesday, October 28, 2020 at 3:52:04 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 6:23:21 PM UTC, MitchAlsup wrote:
> > On Wednesday, October 28, 2020 at 11:14:38 AM UTC-5, luke.l...@gmail.com wrote:
>
> > > however a simple loop around the following three instructions shows this not to be the case:
> > >
> > > LD r9, ....
> > > ADD r9, r9, 100
> > > ST r9, ....
>
> > One way to break the dependence is to recognize that the R9 from the LD is
> > dead after being consumed as an operand to the ADD, and there is a new R9
> > from the ADD which is then consumed by the ST (but is not known to be dead
> > until the LD becomes "architectural".
> >
> > Thus there are 2 R9s--the one from the LD and the one from the ADD. Given
> > that the R9 from the LD is dead at the next instruction, there is only one
> > R9 that needs to be "tracked" (the second one). In order to make use of this
> > you have to guarantee that the LD can forward its result to the ADD as an
> > operand without having gone through the register file with a write which is
> > then read from the ADD. If you can also guarantee that the ADD can forward
> > to the ST, you can defer the write of the ADD result until you know that
> > the next-loops LD will happen.
>
> to be clear: this is just operand forwarding, prioritising that over regfile write-reads?

Not "prioritized over", but forwarding can take place of independently and
before result writes can occur.
>
> > And at this point, there are no register
> > write dependencies on a per loop basis, only one at the end of the loop.
>
> interesting. wait... the *whole* loop? moo? :) errr so the entire computation remains exclusively in in-flight RSes data (as far as r9 is concerned) until the exit from the loop?
>
> ohhh is this how the vectorisation in 66000 works? if so then after nearly 2 years i might have got it.

That is basically how it works on scoreboarded implementations, reservation
station machines have a bit more freedom, which may or may not be used while
doing VVM looping.
>
> > So, you defer writes to R9 from LD and ADD, and you subsequently cancel
> > those deferred writes when you know no exceptions have transpired
>
> to clarify, by "cancel" do you mean in Shadowing context, "drop the shadow hold preventing write" rather than "cancel the actual LD"? or something more subtle?

By cancel I mean the writes never happen, even though the instruction was
completed all the way to retire. The write did not HAVE to happen because
another result to the same register overwrote it and is within the horizon
of the execution window. Now, you have to be careful so that if an exception
transpires, the writes are performed (not cancelled).
>
> errr oh hang on no you don't, you really do mean actually cancel the writes because they're detected as WaW and now they're done they can be chucked.

Yes!

> > and
> > the loop is to have at least another iteration. This will require the
> > calculation units to have a few results waiting for writes into the RF
>
> i don't mind that
>
> > and a new cancel write mechanism.
>
> ah. ok so this detection was the complex stuff i thought through 18 months ago, it involved creating a mini "stack" in each FU-REGs DM Cell, where whenever more than one write hazard was noted (on a "stack") this was the detection mechanism by which the previous (below top of "stack") Write hazards could be determined to have been overwritten.

Right, just generalizing it.
>
> and, thus, when they have done their job of being (exclusively) operand forwarded, the fact that they're not "top dog" combined with noting - and this is incredibly important - that there are *no* outstanding shadows held on any insteuctions in front of it - the write can be safely dropped.

This is why you forward almost immediately, and defer write until you don't
actually have to perform the write.
>
> my concern about that mechanism is that it doubles the FU-REGs DM Cell size if you want a 2 entry stack, triples for a 3 entey because it is effectively a duplication of the normal FU-REGs DM Cell in order to record the history (in case of partial shadow cancellation, amongst other things)

Can you say this again and use different words?
>
> however that was my implementation.
>
> is there any other method by which "cancel write" could be implemented which does not involve doubling or tripling the gates needed in FU-REGs?

WaW DM instead of [RW]a[WR] DM.
>
> > If you are issuing the "whole" loop in a single cycle, you are in a position
> > to know about the LD result being dead if another loop iteration is required,
> > and to know that the ADD result is also dead if another iteration is required.
> > So, you can defer the write based on whether an iteration is actually required
> > and cancel the writes if it is required (branch shadow cancellation.)
>
> i believe this is a different way of saying that there is a stack in FUREGs Cells.
>
> > > The problem we have for LibreSOC is this: the number of registers, because it is a hybrid GPU, is a whopping 128 INTs and 128 FP regs *and that is architectural* i.e. real ISA addressable regs, not virtual ones.
> > Above, I added no registers, just more dependence tracking.
>
> indeed. i think we are talking about the same thing, and i judged my implementation to be quite expensive.
>
> what i wondered was: are there alternatives that cost less gates?
>
> one of the things i did note was that the virtual regfile renaming approach might not be able to detect the "drop write" condition that you note is possible in each loop when using dependency tracking. or if it is, i haven't thought it through yet.
>
> l.

Keep thinking, you are narrowing down the patterns space, and one afternoon
it will collapse into the solution.

luke.l...@gmail.com

unread,
Oct 28, 2020, 6:16:24 PM10/28/20
to
On Wednesday, October 28, 2020 at 9:18:27 PM UTC, MitchAlsup wrote:
> On Wednesday, October 28, 2020 at 3:52:04 PM UTC-5, luke.l...@gmail.com wrote:

> > interesting. wait... the *whole* loop? moo? :) errr so the entire computation remains exclusively in in-flight RSes data (as far as r9 is concerned) until the exit from the loop?
> >
> > ohhh is this how the vectorisation in 66000 works? if so then after nearly 2 years i might have got it.
> That is basically how it works on scoreboarded implementations, reservation
> station machines have a bit more freedom, which may or may not be used while
> doing VVM looping.

'bout frickin time i got the message about VVM :)


> > my concern about that mechanism is that it doubles the FU-REGs DM Cell size if you want a 2 entry stack, triples for a 3 entry because it is effectively a duplication of the normal FU-REGs DM Cell in order to record the history (in case of partial shadow cancellation, amongst other things)
> Can you say this again and use different words?

ok.

* take the FU-REGs DM Cell as the unit, containing as it does the record of the write hazard requirements, for a given reg#, for a given FU#.
* add MULTIPLE such write-hazard latches, where, if the first is active, the current instruction sets the SECOND, and if the second is active then the third is set.
* this we call the "stack" of hazards (and is why the size of the FUREGs matrix is doubled or tripled)
* have some instructions which result in 3 consecutive writes to the same register. this sets up double WaW meaning that the 2 oldest instructions could have their results discarded once OpFwding is completed

BUT

* have instruction2 shadowed by another instruction IN BETWEEN instruction3 (let us call it 2.5)

now complete instruction1 and 2 but leave 2.5 still with a Shadow over 3, but IGNORE this situation and assume that it is perfectly safe to discard 2 as a WaW once the OpFwding is completed.

and now the killer which causes data corruption: call "GODIE" on instruction 2.5

the cancellation of 2.5 results in instruction 3 *also* being cancelled, which, ordinarily would have dissolved instruction2 as a WaW. but, because the shadowing of 2.5 over 3 was IGNORED, the regfile write (2) was discarded because *at the time* it looked like a safe thing to do, and it wasn't.

hence why i said that it is critically important to only do WaW discards if there are absolutely no shadows over any instructions further ahead than the WaW regfile write considering to be dropped.

this applies transitively all the way back to instr1.

better words?


> > is there any other method by which "cancel write" could be implemented which does not involve doubling or tripling the gates needed in FU-REGs?
> WaW DM instead of [RW]a[WR] DM.

aa... oh. new concept to fit into brain. which i have a sneaking suspicion might be the exact same thing as the "stack" idea above, just without needing to be a stack (just one WaW latch per FUREGs DM cell rather than a stack of them, because of the transitive relationships of WaW hazards).

couple of questions if i may:

(a) does the WaW DM interact with the FU-FU matrix at all? in RaW/WaR DMs there is read-pending and write-pending connectivity between FU-REGs and FU-FU. does WaW have to "get in" to FU-FU?

(b) what happens if there is (as with instr 1 2 3 above) a WaW followed by another WaW still inflight? do these two have to be distinguished/discerned (given separate latches) or is it safe when 3 comes in to go "ok you're covered as a WaW already"

> Keep thinking, you are narrowing down the patterns space, and one afternoon
> it will collapse into the solution.

i thought about a Zen quote here then realised a Schroedinger one would fit better :)

l.

Jacob Lifshay

unread,
Oct 28, 2020, 6:35:44 PM10/28/20
to
On Wednesday, October 28, 2020 at 3:16:24 PM UTC-7, luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 9:18:27 PM UTC, MitchAlsup wrote:
> > On Wednesday, October 28, 2020 at 3:52:04 PM UTC-5, luke.l...@gmail.com wrote:
>
> > > interesting. wait... the *whole* loop? moo? :) errr so the entire computation remains exclusively in in-flight RSes data (as far as r9 is concerned) until the exit from the loop?
> > >
> > > ohhh is this how the vectorisation in 66000 works? if so then after nearly 2 years i might have got it.
> > That is basically how it works on scoreboarded implementations, reservation
> > station machines have a bit more freedom, which may or may not be used while
> > doing VVM looping.
> 'bout frickin time i got the message about VVM :)

I haven't heard the term/acronym VVM before and wasn't able to find anything relevant by googling it, what is it?

Jacob

luke.l...@gmail.com

unread,
Oct 28, 2020, 6:40:00 PM10/28/20
to
On Wednesday, October 28, 2020 at 9:09:31 PM UTC, MitchAlsup wrote:
> On Wednesday, October 28, 2020 at 3:02:59 PM UTC-5, luke.l...@gmail.com wrote:
> > my concern with this approach is, how do you ensure rollover of the version number does not occur? i.e how to detect that if say a 5th rename occurs you do not corrupt or try to use the 1st which may still not be committed?
> You cannot commit a new register until its 4th prior use has completed.
> 1111 writes pending and you are issue blocked,
> otherwise you are not--one 4-input AND gate.

ok so unary encoding of what is in use, which could use a PriorityPicker (inverted, detecting zeros) to find a free slot. then yes all 1s indicates "stall"

this leaves us still with a monstrous 256x64 table of latches where:

* each column is the real (ISA) reg# and is accompanied by a suite of 4 bits (0000 meaning no allocation, up to 1111 meaning stall)
* each row is the "virtual" reg# and is also accompanied by a suite of 4 bits, which are unary set (0001, 0010, 0100, 1000)

OR

* each column is the real (ISA) reg# and is accompanied by a suite of 4 bits (0000 meaning no allocation, up to 1111 meaning stall)
* each row is the "virtual" reg#
* each cell contains 4 bits, only one of which may be set (unary) (OR 2 bits binary?)

in this latter it is easy to detect the relationship between real and virtual including which (unary-encoded) "rename" number is given using the cell active state (in only 1 of 4 unary bits) to "route" an incoming activated row out to a column dimply with single AND gates.

including in a multi-issue context? argh no it isn't, because raising multiple rows you have no idea which column it would be assiciated with.

argh.

assuming you had 8 regfile ports you'd actually have to transmit 4 wires in to a given row:

* 1 to say that the lookup on that row,was to be enabled
* 3 to transmit a regfile port (or other ID#) from row, to cell, and out to column.

that ID# could be an indicator to say "write to regfile port1" please.

in this way you could use the Matrix to transmit "virtual" numbered regfile read/write requests through to "real" regfile ports, even several such reads and writes simultaneously, and this is the fascinating bit, without having to actually "cache" the actual virtual reg data.

in other words it's just a translation mechanism rather than a cache.

l.

luke.l...@gmail.com

unread,
Oct 28, 2020, 6:44:46 PM10/28/20
to
On Wednesday, October 28, 2020 at 10:35:44 PM UTC, Jacob Lifshay wrote:

> I haven't heard the term/acronym VVM before and wasn't able to find anything relevant by googling it, what is it?

i hadn't either, it's something that Mitch uses in the context of the Vectorisation system in his 66000 ISA, and i forgot to ask what it meant :) you *might* be able to search 66000 vector or perhaps specifically use the google groups search of comp.arch, which is, by some weird quirk, not properly linked up to google's main search engine, i have no idea why.

l.

MitchAlsup

unread,
Oct 28, 2020, 7:11:56 PM10/28/20
to
The Virtual Vector Method is an architectural extension to the My 66000
instruction set that enables vectorization of almost any loop with exactly
2 instructions added to the instruction set.

These two instructions are like bookends {brackets?} to the vectorized code.
The first is called VEC and provides a register to HW so that exceptions can
be made precise, and a bitmap of registers which denote registers which carry
loop dependencies (from iteration k to iteration k+1.) The second instruction
is the LOOP instruction which performs the ADD/CMP/BC semantics of typical
loops and adds the ability to perform loop exit conditions in the same inst-
ruction. The whole mem* and str* libraries can be vectorized--except those
which call other members of that library. Almost all loops can be vectorized
AND BECAUSE loops are vectorized (instead of instructions) the compiler does
NOT have to loop for loop carried memory dependencies (aliases) {nor many
other analyses} when deciding to vectorize a loop.

HW is allowed to run a loop iteration in a single cycle, or as many loop
iterations as it has HW to service, per cycle. In effect, HW synthesizes
SIMD calculations from serial (vonNeumann) instructions (width) while
providing guarantees to SW about how loop iterations are "performed".

HW monitors all memory references, and if one loop consumes the store from
another loop, the proper value is delivered to the proper iteration:: this
makes loops such as:

for( i = 0; i < MAX; i++ )
a[i] = b[i]*D + b[i]*a[i-j]

vectorizale. Should j be a fairly large number so the memory reference does
not cause a stall, the loop runs at full vector performance, if j does not
have this property, the loop runs as fast as j and other data-dependencies
allow. The compiler can ASSUME that j does not cause aliasing--HW will run
the same code with or without such impediments--and perform as if the
original serial code was performed on a much "bigger" machine.
>
> Jacob

MitchAlsup

unread,
Oct 28, 2020, 7:15:13 PM10/28/20
to
Google needs lots and lots of hits before it finds anything. That amount of
hits has not yet transpired.
>
> l.

Technically its "My 66000 ISA" and it is no longer "Mitch's" but belongs to
the world.

Should anyone want to read about it, all they need to do is to request the
My 66000 Principles of Operation document and agree that if some third party
ask where those ideas came from the finger of fate points back at me.

Terje Mathisen

unread,
Oct 29, 2020, 3:47:02 AM10/29/20
to
MitchAlsup wrote:
> On Wednesday, October 28, 2020 at 5:44:46 PM UTC-5,
> luke.l...@gmail.com wrote:
>> On Wednesday, October 28, 2020 at 10:35:44 PM UTC, Jacob Lifshay
>> wrote:
>>
>>> I haven't heard the term/acronym VVM before and wasn't able to
>>> find anything relevant by googling it, what is it?

I read it as "Virtual Vector Method", maybe Mitch calls it something else?
>>
>> i hadn't either, it's something that Mitch uses in the context of
>> the Vectorisation system in his 66000 ISA, and i forgot to ask what
>> it meant :) you *might* be able to search 66000 vector or perhaps
>> specifically use the google groups search of comp.arch, which is,
>> by some weird quirk, not properly linked up to google's main search
>> engine, i have no idea why.
>
> Google needs lots and lots of hits before it finds anything. That
> amount of hits has not yet transpired.

Google used to be _far_ more promiscuous, back in its infancy they had
about 3M hits on my name, currently (checking...) it is just 447K and a
significant part of those are pointing to other people who happen to
share my name.

It is sometimes nice to be able to hide a bit. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stefan Monnier

unread,
Oct 29, 2020, 8:58:26 AM10/29/20
to
>>>> I haven't heard the term/acronym VVM before and wasn't able to
>>>> find anything relevant by googling it, what is it?
> I read it as "Virtual Vector Method", maybe Mitch calls it something else?

I thought it was the VolksVectorMachine: the vector machine for the people.


Stefan

EricP

unread,
Oct 29, 2020, 12:50:21 PM10/29/20
to
luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 5:11:22 PM UTC, EricP wrote:
>
> eric thanks for responding.
>
>> If you've got 128 regs to play with, wouldn't it be easier to teach
>> the compiler register allocator to round robin is allocations?
>
> turns out that this is a pain to do (and frowned on, to make a compiler
> tied to a microarchitecture's limitations)

I understand their reticence to deal with it but to minimize
stalls _ALL_ compilers for In-Order uArches have to deal with
extensive sets of make-and-model specific rules for pipeline
issue scheduling and RAW, WAR and WAW dependencies.

If you start a divide DIV r9=r7/r8 it knows r9 will be busy for
such-n-such clocks. Avoiding WAW dependency should be the
easiest to handle with 128 registers.
There are also rules about what can dual or quad issue together.
Avoiding a WAR stall requires knowing the issue width because
they can't issue together, or using a different register.
Avoiding a RAW stall is likely hardest because there may
not be other instructions to slot between a writer and reader,
especially for long latency instructions.

One of the reasons to have lots of arch registers is to
allow many outstanding operations without OoO rename cost,
and allow a compiler room to maneuver.

>> One inexpensive renaming method is to just extend the
>> Architecture Register Number with some version number bits
>> to give the Physical Register Number.
>> If the ARN is 4 bits and it is extended with a 2 bit version
>> {ARN-4,VER-2} then there is a direct map from logical to
>> physical so no mapping table is required.
>> With this you could have 4 versions of each register
>> in flight without a stall occurring.

Mitch's idea is far superior (more on that below).
To be clear, I'm not suggesting this approach.
Just trying to show that some cheaper rename methods are
possible but it really is for smaller register sets.

> my concern with this approach is, how do you ensure rollover of the version number does not occur? i.e how to detect that if say a 5th rename occurs you do not corrupt or try to use the 1st which may still not be committed?

It's just doing WAW detection but for sets of 4.

The 2 bit version counter has 4 busy flags, one per version.
On allocate, increment counter and check if busy flag set.
If already set then stall until busy clear.
On allocate, set the busy flag
On write back, clear busy flag.

A scoreboard would already have such a busy flag for each
physical register so this just uses an expanded set of those.

> tracking this is where i started going "um" and "err" :)
>
>> It is simple but quite inefficient and wasteful
>> as 99% of the physical registers sit unused most of the time.
>
> in a GPU which is one of the dual tasks this is for those 128 regs which will often be doing 4x4FP vectors utilisation will be pretty high, because the power penalty for spill is so drastic in a GPU workload.

How many registers in flight at once is that?
Because it still sounds like register allocation is the best solution.

>> The fly in that approach is your 128 architecture registers,
>> times 2 for the floats, times the number of versions.
>> Having two banks of 512 physical registers, mostly unused,
>> is... unappealing.
>
> indeed :)
>
>> Also when you introduce renaming, even a light version,
>> you also must consider conditional branch mispredict rollback.
>
> ok. so we're using Shadowing. so the hold-and-cancellation from the Shadow Matrices when "godie" is called would need to propagate back through to the register cache to say that the virtual reg allocation is now free for reuse, on all virtual registers allocated to all shadow-cancelled instructions, and no "damage" occurs (no rollback needed either. it means a lot of inflight RSes but that is tolerable)
>
> does that sound workable?

Sounds plausible. :-)
(I don't know enough about what you are doing to comment)

You want something concurrent, ideally 1 clock,
that doesn't have to walk serially through some lengthly sequence,
and that can work while the branch is in-flight and not have to
wait until the branch retires.

One could combine Mitch's forwarding suggestion with the version
bits and Busy flags, and forget about replicating the register files.
The arch reg# plus version bits gives a unique forwarding ID.
The Busy flags ensure an ID will only be reused after
a prior instruction for a register has written back.
So on write back an FU sends the {Reg,Version} to RegFile
which checks if the version is the same as the one writing.
If it is then this is the most recent version and it writes to RegFile[reg].
In any case, it clears its Busy flag RegFileStatus[reg].Busy[version]
to allow the version number reuse.
Listeners waiting for forwarding match the full {reg,ver} ID tag,
or watch the scoreboard RegFileStatus[reg].Busy[version] flags
for register dependent's to grab forwarding data off the bus.
But it still needs cleanup on mispredict.


Jacob Lifshay

unread,
Oct 29, 2020, 3:49:56 PM10/29/20
to
On Thursday, October 29, 2020 at 9:50:21 AM UTC-7, EricP wrote:
> luke.l...@gmail.com wrote:
> > On Wednesday, October 28, 2020 at 5:11:22 PM UTC, EricP wrote:
> >> It is simple but quite inefficient and wasteful
> >> as 99% of the physical registers sit unused most of the time.
> >
> > in a GPU which is one of the dual tasks this is for those 128 regs which will often be doing 4x4FP vectors utilisation will be pretty high, because the power penalty for spill is so drastic in a GPU workload.
> How many registers in flight at once is that?
> Because it still sounds like register allocation is the best solution.

For a 8x whole-function vectorized version of a common vertex shader which has a 4x4xf32 matrix multiplied by a 4xf32 vector (used for transforming coordinates in 3D space), you'd get:
64 64-bit registers for the matrices and 16 64-bit registers for the vectors, and probably another 16 64-bit registers for temporaries for accumulation.

The design we're using could have up to 30-50 microarchitectural registers in flight at once.

luke.l...@gmail.com

unread,
Oct 30, 2020, 7:52:59 AM10/30/20
to
On Wednesday, October 28, 2020 at 9:18:27 PM UTC, MitchAlsup wrote:
> On Wednesday, October 28, 2020 at 3:52:04 PM UTC-5, luke.l...@gmail.com wrote:
> > is there any other method by which "cancel write" could be implemented which does not involve doubling or tripling the gates needed in FU-REGs?
> WaW DM instead of [RW]a[WR] DM.

my previous post contained a lot of thinking-out-loud. the key question is: if you add a WaW DM, does it cope with one WaW or can it cope with two? WaW followed by WaW *on the same register*?

for tight loops (such as the LD ADD ST example), this situation would easily occur in high-speed designs where three (or potentially more) sets of the loop's instructions were in-flight at the same time.

l.

luke.l...@gmail.com

unread,
Oct 30, 2020, 9:12:16 AM10/30/20
to
On Thursday, October 29, 2020 at 4:50:21 PM UTC, EricP wrote:
> luke.l...@gmail.com wrote:
> > On Wednesday, October 28, 2020 at 5:11:22 PM UTC, EricP wrote:
> >
> > eric thanks for responding.
> >
> >> If you've got 128 regs to play with, wouldn't it be easier to teach
> >> the compiler register allocator to round robin is allocations?
> >
> > turns out that this is a pain to do (and frowned on, to make a compiler
> > tied to a microarchitecture's limitations)
> I understand their reticence to deal with it but to minimize
> stalls _ALL_ compilers for In-Order uArches have to deal with
> extensive sets of make-and-model specific rules for pipeline
> issue scheduling and RAW, WAR and WAW dependencies.

and TI DSP compiler scheduling which is 2x VLIW (e.g. left and right stereo audio processing) it's even more essential. TI's early compilers (1990s) really were suboptimal.

> If you start a divide DIV r9=r7/r8 it knows r9 will be busy for
> such-n-such clocks.

i'd say that this is more the programmer's responsibility to know that DIV would take longer, and to avoid using them excessively. if however the compiler created a WaW hazard on a DIV operation in a tight loop and the hardware couldn't cope with that then yes the consequences would be much more severe than a WaW on an ADD.

> Avoiding WAW dependency should be the
> easiest to handle with 128 registers.

wweeelll... as jacob pointed out in another post we can easily use 70 to 80% of those 128 regs for even the most mundane and routine 3D matrix calculations. and consequently get nervous about spill.

> There are also rules about what can dual or quad issue together.
> Avoiding a WAR stall requires knowing the issue width because
> they can't issue together, or using a different register.
> Avoiding a RAW stall is likely hardest because there may
> not be other instructions to slot between a writer and reader,
> especially for long latency instructions.

i hate in-order. no that's wrong: i have a feeling of hate oh and i am blaming it on in-order :)

the compromises you have to make, the complex workarounds and caveats to deal with the dis-satisfactory performance... sigh

> One of the reasons to have lots of arch registers is to
> allow many outstanding operations without OoO rename cost,
> and allow a compiler room to maneuver.

in a 3D context there is an additional requirement: avoid spill entirely. the power consumption hit on parallel processing workloads that are essentially LD COMPUTE ST is so bad that it takes any processor needing spill completely our if commercial viability.

with most operations (crude rule of thumb) being routinely 4x FP32 where a standard CPU would have a scalar quantity quadrupling the regfile depth from normal 32 is a minimum starting point.

*after* that then yes the reasoning you suggest applies. it's just thst the baseline expectation went up by a factor of 4.


> > my concern with this approach is, how do you ensure rollover of the version number does not occur? i.e how to detect that if say a 5th rename occurs you do not corrupt or try to use the 1st which may still not be committed?
> It's just doing WAW detection but for sets of 4.

ok, got it. interesting. i wonder how that would work out, under real workloads.

> The 2 bit version counter has 4 busy flags, one per version.
> On allocate, increment counter and check if busy flag set.
> If already set then stall until busy clear.
> On allocate, set the busy flag
> On write back, clear busy flag.
>
> A scoreboard would already have such a busy flag for each
> physical register so this just uses an expanded set of those.

interesting. so... it would be... you would be ganging 4 busy signals from the DMs together to produce the 1 busy back at the WaW level?

i am getting confused, trying to think it through.


> >> Also when you introduce renaming, even a light version,
> >> you also must consider conditional branch mispredict rollback.
> >
> > ok. so we're using Shadowing. so the hold-and-cancellation from the Shadow Matrices when "godie" is called would need to propagate back through to the register cache to say that the virtual reg allocation is now free for reuse, on all virtual registers allocated to all shadow-cancelled instructions, and no "damage" occurs (no rollback needed either. it means a lot of inflight RSes but that is tolerable)
> >
> > does that sound workable?
> Sounds plausible. :-)

:)

> (I don't know enough about what you are doing to comment)

standard precise-capable shadowed 6600. it means no rollback, no history, no snapshots and no transactions needed because nothing ever commits that is unsafe (it just sits in "inflight" result latches), and "cleanup" is provided automatically and implicitly through "cancellation".

ok ok yes there is the concept of rollback, but only inasmuch as inflight operations are cancelled rather than allowed to commit (irreversibly)


> So on write back an FU sends the {Reg,Version} to RegFile
> which checks if the version is the same as the one writing.

the checking, itself, is an area of concern. does the checking require a multi-way CAM? this wooukd be incredibly expensive. i particularly want to avoid binary comparisons especially on a table with say 32 (or 128) rows, the gate coubt and power consumption will be off the scale.

l.

MitchAlsup

unread,
Oct 30, 2020, 2:41:48 PM10/30/20
to
On Friday, October 30, 2020 at 6:52:59 AM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 9:18:27 PM UTC, MitchAlsup wrote:
> > On Wednesday, October 28, 2020 at 3:52:04 PM UTC-5, luke.l...@gmail.com wrote:
> > > is there any other method by which "cancel write" could be implemented which does not involve doubling or tripling the gates needed in FU-REGs?
> > WaW DM instead of [RW]a[WR] DM.
>
> my previous post contained a lot of thinking-out-loud. the key question is: if you add a WaW DM, does it cope with one WaW or can it cope with two? WaW followed by WaW *on the same register*?

A WaW DM could be as simple as a counter per register without any reference
to FUs, in effect concatenating register version number to each register.
>
> for tight loops (such as the LD ADD ST example), this situation would easily occur in high-speed designs where three (or potentially more) sets of the loop's instructions were in-flight at the same time.

Yes, easily. For LD/ADD/ST you are going to need LD-latency + ADD-latency +
ST-Data-In-Latency = renamed-registers/register.
>
> l.

MitchAlsup

unread,
Oct 30, 2020, 2:46:56 PM10/30/20
to
4×4 matrix takes 16 registers
4× vector takes 4 regsiters
4× accumulator takes 4 registers
---
24 registers in use for coordinate transform of a 4× vector into another 4× vector.

You hang onto the 4×4 matrix and stream 4 new registers of another vector in
while streaming out the previous 4× vector already processed.

luke.l...@gmail.com

unread,
Oct 30, 2020, 3:48:11 PM10/30/20
to
On Friday, October 30, 2020 at 6:41:48 PM UTC, MitchAlsup wrote:
> On Friday, October 30, 2020 at 6:52:59 AM UTC-5, luke.l...@gmail.com wrote:

> > my previous post contained a lot of thinking-out-loud. the key question is: if you add a WaW DM, does it cope with one WaW or can it cope with two? WaW followed by WaW *on the same register*?
> A WaW DM could be as simple as a counter per register without any reference
> to FUs, in effect concatenating register version number to each register.

aa.. oh :) which interestingly brings us back closer to Eric's idea.

the original idea i had 18 months ago i am beginning to see was more of a full duplication of the entire FU-REGs Matrix, in effect embedding the (above) counter-register concept *on a per cell basis* to redirect (and allow full or partial rollback aka shadow-driven cancellation)

in effect this is WaW-driven register renaming but embedded in every cell rather than being external and consequently is horrendously expensive

instead of having in-cell (per-cell) WaW renaming it seems infinitely better to have virtual naming, where a completely new row is allocated to a (renamed) register.

you get exactly the same benefits, just that it takes 3x 4x less gates.

this just leaves how to do the mapping table between virtual and real regs without going mental on CAM gatecount.

l.

luke.l...@gmail.com

unread,
Oct 31, 2020, 8:11:19 AM10/31/20
to
On Friday, October 30, 2020 at 7:48:11 PM UTC, luke.l...@gmail.com wrote:

> this just leaves how to do the mapping table between virtual and real regs without going mental on CAM gatecount.

always useful to draw it out, apologies to people used to ascii-only on newsgroups:
https://libre-soc.org/3d_gpu/isa_to_virtual_regs_table.jpg

* ISA reg numbers are in rows
* Virtual (including renamed) reg numbers are in columns
* from both the bottom *and* the right are multi-issue (unary) "Set1, Set2" signals
* where these cross they are ANDed to set a Latch in a given cell
- at no time will there be more than one cell set per row
- OR per column
* there being only one per column a "reset" signal can be per column
* for regfile port "lookup" the per-column Set1/Set2 wires may be re-used
- they may be binary or unary (as preferred)
- read and write ports are all numbered
* a "request virtual-to-real regfile port redirection" signal is along the bottom
- this is ANDed with the Latch
- that is MUXed with the per-column Set1/Set2 wires
- ANDs which accumulate in Great Big ORs fire per-row outputs

in this way any "virtual" read or write request to a register file port may be translated to a "real" request. the actual regfile data transfers on respective associated broadcast buses are external and do not need routing through this Matrix: the Matrix simply provides name translation, nothing else.

note that if regfile port IDs are encoded in binary, an entire column can light up like a Mythbusters Christmas tree with XOR gates unless the EN is carefully arranged.

also that the re-use of the Set1/2/3 multi-issue wires for dual-purpose of regfile port enable requires that the number of wires be MAX(multi-issue, N(readports) + N(writeports))

am i missing anything?

l.

MitchAlsup

unread,
Oct 31, 2020, 12:27:32 PM10/31/20
to
On Saturday, October 31, 2020 at 7:11:19 AM UTC-5, luke.l...@gmail.com wrote:
> On Friday, October 30, 2020 at 7:48:11 PM UTC, luke.l...@gmail.com wrote:
>
> > this just leaves how to do the mapping table between virtual and real regs without going mental on CAM gatecount.
>
> always useful to draw it out, apologies to people used to ascii-only on newsgroups:
> https://libre-soc.org/3d_gpu/isa_to_virtual_regs_table.jpg

About what I was expecting. You learn well Grasshopper......

luke.l...@gmail.com

unread,
Oct 31, 2020, 12:49:20 PM10/31/20
to
On Saturday, October 31, 2020 at 4:27:32 PM UTC, MitchAlsup wrote:
> On Saturday, October 31, 2020 at 7:11:19 AM UTC-5, luke.l...@gmail.com wrote:
> > always useful to draw it out, apologies to people used to ascii-only on newsgroups:
> > https://libre-soc.org/3d_gpu/isa_to_virtual_regs_table.jpg
> About what I was expecting. You learn well Grasshopper......

i'm 50! (and 3/4) :) i'll add the renaming vector next. got the free / busy done, that's fairly obvious, AND on the Qs of 4 latches for busy. which one to set can be via a priority picker but the tricky bit is supporting multi-issue.

e.g. LD r9 followed by ADD r9, r9 followed by MUL r9, r9 would require the ISA reg vector to allocate *three* virtual regs on the same row in the same clock cycle.

ah that breaks one of the things i said about there being no overlaps on any row or column doesn't it?

that means... errr... that... drat. the WaW ordering needs to be respected, and earlier instructions on that ISA reg prevent later instructions from writing.

hmmm needs more thought.

l.

EricP

unread,
Oct 31, 2020, 4:09:53 PM10/31/20
to
luke.l...@gmail.com wrote:
> On Thursday, October 29, 2020 at 4:50:21 PM UTC, EricP wrote:
>> It's just doing WAW detection but for sets of 4.
>
> ok, got it. interesting. i wonder how that would work out, under real workloads.

Dunno yet, I'm just working through this at the same time as you.

>> The 2 bit version counter has 4 busy flags, one per version.
>> On allocate, increment counter and check if busy flag set.
>> If already set then stall until busy clear.
>> On allocate, set the busy flag
>> On write back, clear busy flag.
>>
>> A scoreboard would already have such a busy flag for each
>> physical register so this just uses an expanded set of those.
>
> interesting. so... it would be... you would be ganging 4 busy signals from the DMs together to produce the 1 busy back at the WaW level?
>
> i am getting confused, trying to think it through.
>
>
>>>> Also when you introduce renaming, even a light version,
>>>> you also must consider conditional branch mispredict rollback.
>>> ok. so we're using Shadowing. so the hold-and-cancellation from the Shadow Matrices when "godie" is called would need to propagate back through to the register cache to say that the virtual reg allocation is now free for reuse, on all virtual registers allocated to all shadow-cancelled instructions, and no "damage" occurs (no rollback needed either. it means a lot of inflight RSes but that is tolerable)
>>>
>>> does that sound workable?
>> Sounds plausible. :-)
>
> :)
>
>> (I don't know enough about what you are doing to comment)
>
> standard precise-capable shadowed 6600. it means no rollback, no history, no snapshots and no transactions needed because nothing ever commits that is unsafe (it just sits in "inflight" result latches), and "cleanup" is provided automatically and implicitly through "cancellation".
>
> ok ok yes there is the concept of rollback, but only inasmuch as inflight operations are cancelled rather than allowed to commit (irreversibly)

Ok, and the commits (result writeback) happen in order when each
FU has the oldest instruction and is ready to retire.
Which ensures precise interrupts.

>> So on write back an FU sends the {Reg,Version} to RegFile
>> which checks if the version is the same as the one writing.
>
> the checking, itself, is an area of concern. does the checking require a multi-way CAM? this wooukd be incredibly expensive. i particularly want to avoid binary comparisons especially on a table with say 32 (or 128) rows, the gate coubt and power consumption will be off the scale.
>
> l.

No, no CAM required. There is only one current version number
for each arch/physical register so just a single "==" compare.

if (RegFileStatus[resultBus.DstReg].Version == resultBus.Version)
RegFileData[resultBus.DstReg] = resultBus.Data;


I'm not sure this versioning approach would actually improve anything.

IIRC you are using valueless Reservations Stations,
meaning the RS pull their values from the reg file when
all sources operands are ready and FU issues to execute.
That means there is no place to stash the forwarded values
of alternate register versions and it has to be held in
the FU until it is the oldest and allowed to write back.
That is going to keep FU's allocated until their results
are allowed to retire at result writeback.

Expanding on your original example


I1: LD r8, ....
I2: ADD r9, r9, 100
I3: ADD r9, r8, r7
I4: ST r9,...

which could dispatch the instructions to multiple FU's
but they would have RAW stalled in the register read stage.
assume R9's 2-bit version number is initially 0

I1: LD r9, ....
decode assigns r9-1 {reg=9,ver=1} as dest virtual register
RegFileStatus[r9].Version++
I2: ADD r9, r9, 100
decode assigns r9-1 as source virtual register.
decode assigns r9-2 as dest virtual register.
RegFileStatus[r9].Version++
dispatch passes I2 to FU-Add-1
FU-Add-1 stalls because source r9-1 is Busy (pending write)
I3: ADD r9, r8, r7
decode assigns r9-2 as source virtual register.
decode assigns r9-3 as dest virtual register.
RegFileStatus[r9].Version++
dispatch passes I3 to FU-Add-2
FU-Add-2 stalls because source r9-2 is Busy (pending write)
I4: ST r9,...
decode assigns r9-3 as source virtual register.
ST FU stalls because r9-3 is Busy
...
I1 LD FU completes and writes result to r9-1.
But since current version of r9 is 3
RegFileStatus[resultBus.DstReg].Version != resultBus.Version
and reg file write of result data does NOT occur.
The result releases the allocated version 1 Busy flag
RegFileStatus[resultBus.DstReg].Busy[resultBus.Version] = 0;
Also when FU-Add-1.RS sees {resultBus.DstReg,resultBus.Version}
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
...
I2 FU-Add-1 completes and writes result to r9-2.
But since current version of r9 is 3
and reg file write of result data does NOT occur.
The result releases the allocated version 2 Busy flag
Also when FU-Add-2.RS sees r9-2 result
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
...
I3 FU-Add-2 completes and writes result to r9-3.
Now the current version of r9 is 3
and reg file write of result data DOES occur.
The result releases the allocated version 3 Busy flag
Also when FU-Add-2.RS sees r9-2 result
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.

It doesn't complete any faster than it would without versions.
Adding more result buses could allow multiple result writebacks
multiple fowardings, and multiple issues per clock but wouldn't
help this example.

Trying a WAW dependency:

I1: LD r9, ....
I4: ST r9,...
I2: ADD r9, r8, r7

I1: LD r9, ....
decode assigns r9-1 {reg=9,ver=1} as dest virtual register
RegFileStatus[r9].Version++
I2: ST r9,...
decode assigns r9-1 as source virtual register.
dispatch passes I2 to FU-ST-1
FU-ST-1 stalls because source r9-1 is Busy (pending write)
I3: ADD r9, r8, r7
decode assigns r8-0, r7-0 as source virtual register.
decode assigns r9-2 as dest virtual register.
RegFileStatus[r9].Version++
dispatch passes I3 to FU-Add-1
FU-Add-1 executes immediately because r8-0 and r7-0 are ready.
...
I3 FU-Add-1 completes but can't write back its result
because I2 has not completed and doing so would make
interrupts imprecise. I3 stalls at writeback.
...
I1 FU-LD completes and writes result to r9-1.
But current version of r9 is 2
and reg file write of result data does NOT occur.
The result releases the allocated version 1 Busy flag
RegFileStatus[resultBus.DstReg].Busy[resultBus.Version] = 0;
Also when FU-ST-1.RS sees r9-1
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
FU-ST-1 starts execution immediately and unblocks r9-2 writeback.
FU-Add-1 writes back its result.

Which is the same as it would have been without versioning.

So these register versions really don't get you anywhere
because they have to write back in-order anyway.
And you can't do forwarding early because the RS's are
valueless and can only pull operands when all are ready.

Does this make sense?

luke.l...@gmail.com

unread,
Oct 31, 2020, 7:28:10 PM10/31/20
to
On Saturday, October 31, 2020 at 8:09:53 PM UTC, EricP wrote:
> luke.l...@gmail.com wrote:
> > On Thursday, October 29, 2020 at 4:50:21 PM UTC, EricP wrote:
> >> It's just doing WAW detection but for sets of 4.
> >
> > ok, got it. interesting. i wonder how that would work out, under real workloads.
> Dunno yet, I'm just working through this at the same time as you.

oh cool.

> > standard precise-capable shadowed 6600. it means no rollback, no history, no snapshots and no transactions needed because nothing ever commits that is unsafe (it just sits in "inflight" result latches), and "cleanup" is provided automatically and implicitly through "cancellation".
> >
> > ok ok yes there is the concept of rollback, but only inasmuch as inflight operations are cancelled rather than allowed to commit (irreversibly)
> Ok, and the commits (result writeback) happen in order when each
> FU has the oldest instruction and is ready to retire.

funnily enough the commits don't necessarily happen in order in a straight 6600 (let alone precise-capable or multi-issue augmented) because the combination of FU-REGs (which identifies - tags - read-hazards and write-hazards) and the FU-FU (which is where the inter-FU result-order dependencies are recorded) can have batches of instructions that have absolutely nothing to do with each other: no inter-dependencies such as:

(1) MUL r1, r2
(1) ADD r2, r3
(2) MUL r10, r12
(2) ADD r12, r13

only if you *create* [unnecessary] dependencies deliberately would these commit in-order. otherwise they will result in two (mini) separate, independent Directed Acyclic Graphs.

it's one of the weird differences between scoreboards and Tomasulo: the round-robin numbering on the ROB and the requirement to move a cyclic counter on committing only in-order leaves some results (and associated RSes and their FUs) hanging in the wind.


> Which ensures precise interrupts.

ah the Shadow Matrices themselves give you precise interrupts. ok, to clarify: anything that hasn't been covered by a Shadow *CANNOT* be rolled back, period, by definition. so if you have an incoming interrupt and there are some instructions under Shadow and some not, the ones under any "Shadow" (whether it be LD/ST, predication, anything) may be pulled, and you can service the interrupt a bit quicker, but the ones that have already been given 100%-guaranteed free rein to commit? mmm no you have to wait for them. (why? because you don't actually know their issue-order, because you only recorded the R-W *dependencies* as a Directed Acyclic Graph, not the actual *instruction* order)

(you can if you like actually create a periodic otherwise-purposeless "Shadow" which is cast across absolutely everything, cancelling it periodically and re-establishing it. its sole purpose would be to give you the opportunity to drop absolutely everything on the floor in case of an ultra-high priority - NMI - interrupt).


> No, no CAM required. There is only one current version number
> for each arch/physical register so just a single "==" compare.

my concern is: if that's per arch/physical register then it's effectively a CAM (or the same cost as). if there's one "==" per arch/physical, let's say that's 4 bits, that's 4 XOR gates (plus a big-AND) and if there's 128 regs that's... mmm... i know i'm going to get corrected on this... 4x5x128 = 2560 gates.

> if (RegFileStatus[resultBus.DstReg].Version == resultBus.Version)

20 gates for the == if it's 4 bits.

> I'm not sure this versioning approach would actually improve anything.
>
> IIRC you are using valueless Reservations Stations,

yes. the reason they're valueless is because they're one-to-one connected to their Function Unit. so unlike Tomasulo (which can have a queue of RSes per FU) you don't actually need to give the RSes names (or numbers).

> meaning the RS pull their values from the reg file when
> all sources operands are ready and FU issues to execute.

actually i modified the original design to pull source operands individually (and write independently as well). it was a pig. the code complexity is horrendous (a multi-layered FSM), but hey.

> That means there is no place to stash the forwarded values
> of alternate register versions and it has to be held in
> the FU until it is the oldest and allowed to write back.

aa... aa... no, and this was precisely one of the reasons why i converted the original 6600 to what i called "MultiCompUnit" (individual source and result GO_RD/REQ and GO_WR/REQ signals), so as to be able to receive individually-forwarded source operands, rather than it be all-or-nothing.

the (appx 15th iteration) of the LD-ST CompUnit for POWER9 illustrates it:
https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
it has 3 completely separate GO_RD/REQ signal pairs and 2 completely separate GO_WR/REQ signal pairs (POWER9 can do load/store-with-update).

or this one:
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-compunit_multi_rw.jpg

which if you've seen the original from Mitch's book chapters you can recognise it:
https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg

no wait... i think i might have misunderstood what you wrote. give me a sec.. nope it's ok.

* there is a place to stash the forwarded values,
* yes everything that's still got dependencies has to be held,
* no it doesn't have to wait until it's the oldest (explained above about separate DAGs),
* yes on the allowed to write back (if not shadowed)

> That is going to keep FU's allocated until their results
> are allowed to retire at result writeback.

this is just part and parcel of scoreboards (and Tomasulo). and it's not as much of a problem as it seems. the trick that Mitch came up with is: you put multiple RSes on the front of the *same* pipeline. they're considered "separate Computational Units" but they share (contend for) the one pipeline.

actually if you're using pipelines you have to do that. the minimum number of front-end RSes (aka CompUnits) needed per pipeline is the length of the pipeline itself. if you have less, then you have no way of tracking the results. 3 RS front-ends on a 4-long pipeline will always, always run 25% empty, guaranteed.

apologies, as that's quite long i'll go over the rest tomorrow, and leave you with this insight: the 6600/scoreboard system is designed around RaW and WaR hazards. if advance renaming can remove all WaW hazards then that leaves the traditional scoreboard free and clear to deal with RaW+WaR.

l.

mac

unread,
Nov 1, 2020, 11:23:24 AM11/1/20
to

>>>>> I haven't heard the term/acronym VVM before and wasn't able to

> I thought it was the VolksVectorMachine: the vector machine for the people.

Not to be confused with Wulf’s WM

--
mac the naïf

luke.l...@gmail.com

unread,
Nov 1, 2020, 11:48:22 AM11/1/20
to
On Saturday, October 31, 2020 at 8:09:53 PM UTC, EricP wrote:

ok where were we...

> Expanding on your original example

ah! examples :)

> I1: LD r8, ....
> I2: ADD r9, r9, 100
> I3: ADD r9, r8, r7
> I4: ST r9,...

ok so this example differs by adding some extra arithmetic on r9. but it is adding it to the r9-after-renaming.

errr did you mean LD r9 there? looks lije you did from the I1 below.

other than an extra FU being allocated (4 in the chain rather than 3 as before) i would anticipate no difference here. one extra clock cycle in each renamed batch due to the ADD chain.

> I1: LD r9, ....
> decode assigns r9-1 {reg=9,ver=1} as dest virtual register

> It doesn't complete any faster than it would without versions.

i think, from Mitch's analysis you have to have at least 2 loops in-flight before you see any benefit.

> Adding more result buses could allow multiple result writebacks
> multiple fowardings, and multiple issues per clock but wouldn't
> help this example.

multiple forward buses, using that exclusively to pass the WaW'd result, and detecting when the in-flight data can be discarded because:
a) no other RaW source operand needing it remains and
b) it is the "overwritten" result in the WaW and
c) critically, no instruction ahead of it has a Shadow active

this is where you get benefits by saving a regfile write for every discard *and* save on reads/writes to the regfile by using OpFwd.


> Trying a WAW dependency:
>
> I1: LD r9, ....
> I4: ST r9,...
> I2: ADD r9, r8, r7

mmm... this one there is no 2nd in-instruction WaW so there is only the one WaW between the LD and the ADD.

> I1: LD r9, ....
> decode assigns r9-1 {reg=9,ver=1} as dest virtual register
> RegFileStatus[r9].Version++
> I2: ST r9,...
> decode assigns r9-1 as source virtual register.
> dispatch passes I2 to FU-ST-1
> FU-ST-1 stalls because source r9-1 is Busy (pending write)
> I3: ADD r9, r8, r7
> decode assigns r8-0, r7-0 as source virtual register.
> decode assigns r9-2 as dest virtual register.
> RegFileStatus[r9].Version++
> dispatch passes I3 to FU-Add-1
> FU-Add-1 executes immediately because r8-0 and r7-0 are ready.
> ...
> I3 FU-Add-1 completes but can't write back its result
> because I2 has not completed and doing so would make
> interrupts imprecise.

ah hang on: LD/ST has 3 characteristics:
a) AGEN (address generation)
b) determining if operation is valid (Shadow hold until determined)
c) actual operation.

if you can satisfy (b) quickly e.g data is in the cache then yes you can in fact drop the Shadow and the op can commit (when resources permit)

however even when there exists Shadow conditions OpFwding still proceeds without impediment except by the width (number) of the OpFWD bus bandwidth.

> I3 stalls at writeback.

no, the concept of stall does not apply here. the result sits in the output latch and may be OpFWDed. other FUs can continue to run ahead but do so under the Shadow of the LD/ST.

if the operation (if we hold some NMI Shadows as described yesterday and one of them is pulled) is cancelled then these output latches are wiped but it is really important to note that the existence of held Shadows *does not* stop OpFWDing from occurring.


> Which is the same as it would have been without versioning.

the lack of impediment to OpFWDing was not taken into account, and it comes into play most notably when there are multiple loops.

Mitch noted, fascinatingly, that WaW dropping has the side-effect that literally every renamed version of r9 never actually hitting the regfile except on exit from the loop, because the overwrite makes the regwrite redundant.

l.

MitchAlsup

unread,
Nov 1, 2020, 12:23:45 PM11/1/20
to
On Sunday, November 1, 2020 at 10:48:22 AM UTC-6, luke.l...@gmail.com wrote:
> On Saturday, October 31, 2020 at 8:09:53 PM UTC, EricP wrote:
>
> ok where were we...
>
> > Expanding on your original example
>
> ah! examples :)
>
> > I1: LD r8, ....
> > I2: ADD r9, r9, 100
> > I3: ADD r9, r8, r7
> > I4: ST r9,...
>
> ok so this example differs by adding some extra arithmetic on r9. but it is adding it to the r9-after-renaming.
>
> errr did you mean LD r9 there? looks lije you did from the I1 below.
>
> other than an extra FU being allocated (4 in the chain rather than 3 as before) i would anticipate no difference here. one extra clock cycle in each renamed batch due to the ADD chain.
>
> > I1: LD r9, ....
> > decode assigns r9-1 {reg=9,ver=1} as dest virtual register
>
> > It doesn't complete any faster than it would without versions.
>
> i think, from Mitch's analysis you have to have at least 2 loops in-flight before you see any benefit.

Compiler reuse of temporary registers also display improvement.
Which is why VVM only writes the loop-carried dependencies at loop termination.
>
> l.

luke.l...@gmail.com

unread,
Nov 1, 2020, 2:12:54 PM11/1/20
to
On Sunday, November 1, 2020 at 5:23:45 PM UTC, MitchAlsup wrote:

> > Mitch noted, fascinatingly, that WaW dropping has the side-effect that literally every renamed version of r9 never actually hitting the regfile except on exit from the loop, because the overwrite makes the regwrite redundant.
> Which is why VVM only writes the loop-carried dependencies at loop termination.

the nice thing about VVM (which i now grok, after 18+ months) is that it simply recognises - mirrors - this microarchitectural optimisation of WaW-aware OoO, *but*, that it can fall-back, should there not be enough in-flight latches, to non-WaW OpFWDing and likely works just fine on in-order.

i suspect then that VVM can be "talked about" in abstract terms independent of microarchitecture, where the OpFWD/WaW variant is simply an "ultimate optimal" implementation.


onward with the analysis at hand. carrying on from the insight shared with Eric, that after WaW rename we are left eith pure RaW and WaR which scoreboards handle with ease. note in passing that the original 6600 *detected* the WaW condition by comparing the current WritePending vector with the reg# converted to unary. a match would stall issue until the relevant write hazards had cleared. this is the "safe, less complex, less optimal" implementation route.

one absolutely crucial thing i realised is that WaWs have to also respect order, just like the DAGs (Directed Acyclic Graphs) created on RaW and WaR. i thought for a day about how that could be implemented and then realised that actuslly it is exactly like Shadowing.

so... umm... why not actually use a Shadow Matrix?

the number of columns would be our Virtual Regs (we have mapped to VRegs aka renamed-reg#s by now) and the number of rows would be the "rename point count" (the number of times a reg is permitted to be renamed) that Eric and Mitch mentioned.

Shadow Matrices are already set up to record transitive relationships, so a first rename WaW would cover the second, and the third, the second would cover the third.

the only thing that is distinctly odd is that the WaW Shadowing is, i think, *backwards* compared to normal Shadowing. or, at least: definitely not the same behaviour.

look at e.g. a LDST Shadow. when the LDST unit knows that the memory access is 100% guaranteed to succeed it drops the shadow against itself *and all downstream (later-issued)* instructions.

however in the WaW-Shadowed case the control is BACKWARDS. it is the NEWEST (later-issued) instruction that, when it is free to commit, signals back down the WaW chain (if there is one), to the renamed reg it is overwriting, that the predecessor can "drop" itself safely without data corruption.

still on a path to working this out. diagrams to follow which may make this clearer.

l.

MitchAlsup

unread,
Nov 1, 2020, 3:54:35 PM11/1/20
to
On Sunday, November 1, 2020 at 1:12:54 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 1, 2020 at 5:23:45 PM UTC, MitchAlsup wrote:
>
> > > Mitch noted, fascinatingly, that WaW dropping has the side-effect that literally every renamed version of r9 never actually hitting the regfile except on exit from the loop, because the overwrite makes the regwrite redundant.
> > Which is why VVM only writes the loop-carried dependencies at loop termination.
>
> the nice thing about VVM (which i now grok, after 18+ months) is that it simply recognises - mirrors - this microarchitectural optimisation of WaW-aware OoO, *but*, that it can fall-back, should there not be enough in-flight latches, to non-WaW OpFWDing and likely works just fine on in-order.

We generally categorize instruction processing systems as either In-Order or
Out-Of-Order, VVM sits in the OoO camp, but has both feet in the restricted-
partial-order (generally unrecognized) camp. A minimal-partial-order still
retains enough in-orderness that precise exceptions can be performed without
"heroic" methods {such as reorder buffers, and the like}.
>
> i suspect then that VVM can be "talked about" in abstract terms independent of microarchitecture, where the OpFWD/WaW variant is simply an "ultimate optimal" implementation.

NNM in the abstract is a set of rules whereby a loop can have a header
and trailer placed between otherwise scalar code, and whereby the HW
is allowed to bring as much processing capability to the job at hand
without losing the scalar qualities {such as precise exception on the
exact instruction/loop count/... where the exception occured.}

These rules allow for (but do not mandate) SIMD execution when register
and memory dependencies allow multiple lanes of execution.
>
>
> onward with the analysis at hand. carrying on from the insight shared with Eric, that after WaW rename we are left eith pure RaW and WaR which scoreboards handle with ease. note in passing that the original 6600 *detected* the WaW condition by comparing the current WritePending vector with the reg# converted to unary. a match would stall issue until the relevant write hazards had cleared. this is the "safe, less complex, less optimal" implementation route.

Remember at this time the average machine was at 0.1 IPC heading towards 0.2
IPC.
>
> one absolutely crucial thing i realised is that WaWs have to also respect order, just like the DAGs (Directed Acyclic Graphs) created on RaW and WaR. i thought for a day about how that could be implemented and then realised that actuslly it is exactly like Shadowing.

Bingo--grasshopper......
>
> so... umm... why not actually use a Shadow Matrix?
>
> the number of columns would be our Virtual Regs (we have mapped to VRegs aka renamed-reg#s by now) and the number of rows would be the "rename point count" (the number of times a reg is permitted to be renamed) that Eric and Mitch mentioned.
>
> Shadow Matrices are already set up to record transitive relationships, so a first rename WaW would cover the second, and the third, the second would cover the third.
>
> the only thing that is distinctly odd is that the WaW Shadowing is, i think, *backwards* compared to normal Shadowing. or, at least: definitely not the same behaviour.

It is backwards because it is trying to enable you to avoid writing something
rather than enabling some unit of forward progress. By enabling a register
write to be dropped, you imply something ELSE can now make forward progress!
That is why the arcs flow backwards.

luke.l...@gmail.com

unread,
Nov 1, 2020, 6:04:44 PM11/1/20
to
On Sunday, November 1, 2020 at 8:54:35 PM UTC, MitchAlsup wrote:

> NNM in the abstract is

Nanu-Nanu Machine??

> > the only thing that is distinctly odd is that the WaW Shadowing is, i think, *backwards* compared to normal Shadowing. or, at least: definitely not the same behaviour.
> It is backwards because it is trying to enable you to avoid writing something
> rather than enabling some unit of forward progress. By enabling a register
> write to be dropped, you imply something ELSE can now make forward progress!

iinteresting.

> That is why the arcs flow backwards.

... you've done this before!

i'm currently trying to envisage how this would work in multi-issue. how that would preserve the instruction issue order when, in the same batch there are multiple WaWs on the same register.

with WaR and RaW i recall the neat transitive trick of adding extra dependencies between insteuxtions in the same batch.

is there something similar for WaW or is it better to use a cyclic counter and work it out from there?

l.

MitchAlsup

unread,
Nov 1, 2020, 6:37:21 PM11/1/20
to
On Sunday, November 1, 2020 at 5:04:44 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 1, 2020 at 8:54:35 PM UTC, MitchAlsup wrote:
>
> > NNM in the abstract is
>
> Nanu-Nanu Machine??
>
> > > the only thing that is distinctly odd is that the WaW Shadowing is, i think, *backwards* compared to normal Shadowing. or, at least: definitely not the same behaviour.
> > It is backwards because it is trying to enable you to avoid writing something
> > rather than enabling some unit of forward progress. By enabling a register
> > write to be dropped, you imply something ELSE can now make forward progress!
>
> iinteresting.
>
> > That is why the arcs flow backwards.
>
> ... you've done this before!
>
> i'm currently trying to envisage how this would work in multi-issue. how that would preserve the instruction issue order when, in the same batch there are multiple WaWs on the same register.

For reasonable widths, you simply compare every destination register.
{You more or less have to do this anyway to make sure the dependencies
are tracked properly.}

For wider than "reasonable widths" it might be best to organize the
instruction cache into packets and allow instructions from non-sequential
locations to be packed together--this allows all the annotations to be done
once not every time instructions are issued into execution.
>
> with WaR and RaW i recall the neat transitive trick of adding extra dependencies between insteuxtions in the same batch.
>
> is there something similar for WaW or is it better to use a cyclic counter and work it out from there?

What kind of width are we talking (2-4) or wider ?
>
> l.

MitchAlsup

unread,
Nov 1, 2020, 6:38:09 PM11/1/20
to
On Sunday, November 1, 2020 at 5:04:44 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 1, 2020 at 8:54:35 PM UTC, MitchAlsup wrote:
>
> > NNM in the abstract is
>
> Nanu-Nanu Machine??

Just think of it like VVM except both Vs have a crutch and a list.

luke.l...@gmail.com

unread,
Nov 1, 2020, 7:22:35 PM11/1/20
to
On Sunday, November 1, 2020 at 11:37:21 PM UTC, MitchAlsup wrote:

> > i'm currently trying to envisage how this would work in multi-issue. how that would preserve the instruction issue order when, in the same batch there are multiple WaWs on the same register.
> For reasonable widths, you simply compare every destination register.
> {You more or less have to do this anyway to make sure the dependencies
> are tracked properly.}

simply not having any information to go on at this stage i am guessing a starting point of up to 4 renames per ISAreg before saying "ok that's enough" and stalling.

> For wider than "reasonable widths" it might be best to organize the
> instruction cache into packets and allow instructions from non-sequential
> locations to be packed together--this allows all the annotations to be done
> once not every time instructions are issued into execution.

we're planning something slightly different because i'm not comfortable with the monster size of the FU-REGs and FU-FU tables otherwise (of the order of 250,000 gates).

with 128 FP,64 128 INT64 and (under discussion) 128 4-bit Condition Registers (welcome to vectorisation of OpenPOWER...) the plan is to have a full (non-sparse) arrangement for the "standard'" scalar PowerISA regs (0-31) and to do a "sparse" matrix a la "lane banks" for regs 32-127.

thus if vector operations produce elements where the ISAregs numbers are modulo 4 they go through the top (sparse) renaming *and* FU-REGs *and* FU-FU, otherwise they go through the "scalar" route.

i'd like to avoid additional instruction cache complexity in this particular design because it is a "tagged" (Sub-VL) architecture where scalar instructions are tagged as "vectorised" with a prefix, which indicates "this scalar instruction is actually a vector instruction, please shove a sequential batch of instructions directly into the multi-issue execution engine, from 0 to VL-1 where VL=vectorlength"


> > with WaR and RaW i recall the neat transitive trick of adding extra dependencies between insteuxtions in the same batch.
> >
> > is there something similar for WaW or is it better to use a cyclic counter and work it out from there?
> What kind of width are we talking (2-4) or wider ?

due to the way that the vectorisation overloads multi-issue we need 4 to 8 for the first main commercial version (modern smartphone / tablet processor, around the 2.5 to 3.5 watts mark) however when we get to Intel / Radeon / NVidia scale (30 to 100W) the idea is to go non-uniform (big/little) with the main (fast) cores doing 4-way and the "GPU" cores (still OpenPOWER ISA compatible) doing 8 or even 16 way to give 8x or 16x onto 2xFP32 SIMD FUs, to give 16 or 32 FP32 FMACs per clock. little monsters with performance comparable to POWER10 in other words.

bottom line we need to plan for scaling up to horrendous-way multi-issue, although at that point (16x) we may have to do things differently anyway. to stay sane or at least keep the maniacal laughter out the picture let's say 4-8.

l.

luke.l...@gmail.com

unread,
Nov 1, 2020, 8:00:28 PM11/1/20
to
ha, hilarious. you'll like this. what scheme works really well to track multi-issue dependencies by using transitivw relationships?

surpriiise: FUREGs when combined with FU-FU.

therefore it should come as no surprise that an identical scheme shoukd work for multi-issue WaW.

* rather that comparing renamed regs against each other, have a bitlevel matrix recording (inclydung transitive relationships) VREGs-VREGs similar to FU-FU.
* ISAregs to VREGs is similar to FU-REGs and, once again, Great Big OR gates create in this case a WAW_pending line per row that fires into VREGs-VREGs.
* from the instruction order the transitive relationships can be created to indicate a WaW, WaWaW, WaWaWaW.

ok back to sleep now.

l.


EricP

unread,
Nov 1, 2020, 9:49:15 PM11/1/20
to
luke.l...@gmail.com wrote:
> On Saturday, October 31, 2020 at 8:09:53 PM UTC, EricP wrote:
>>> standard precise-capable shadowed 6600. it means no rollback, no history, no snapshots and no transactions needed because nothing ever commits that is unsafe (it just sits in "inflight" result latches), and "cleanup" is provided automatically and implicitly through "cancellation".
>>>
>>> ok ok yes there is the concept of rollback, but only inasmuch as inflight operations are cancelled rather than allowed to commit (irreversibly)
>> Ok, and the commits (result writeback) happen in order when each
>> FU has the oldest instruction and is ready to retire.
>
> funnily enough the commits don't necessarily happen in order in a straight 6600 (let alone precise-capable or multi-issue augmented) because the combination of FU-REGs (which identifies - tags - read-hazards and write-hazards) and the FU-FU (which is where the inter-FU result-order dependencies are recorded) can have batches of instructions that have absolutely nothing to do with each other: no inter-dependencies such as:

Yes, that's what I thought looking at the CDC 6600 logic.
To my eye, a consequence is that loads and stores can be out of order.
But its been a while since I looked at it.

> (1) MUL r1, r2
> (1) ADD r2, r3
> (2) MUL r10, r12
> (2) ADD r12, r13
>
> only if you *create* [unnecessary] dependencies deliberately would these commit in-order. otherwise they will result in two (mini) separate, independent Directed Acyclic Graphs.
>
> it's one of the weird differences between scoreboards and Tomasulo: the round-robin numbering on the ROB and the requirement to move a cyclic counter on committing only in-order leaves some results (and associated RSes and their FUs) hanging in the wind.
>
>
>> Which ensures precise interrupts.
>
> ah the Shadow Matrices themselves give you precise interrupts. ok, to clarify: anything that hasn't been covered by a Shadow *CANNOT* be rolled back, period, by definition. so if you have an incoming interrupt and there are some instructions under Shadow and some not, the ones under any "Shadow" (whether it be LD/ST, predication, anything) may be pulled, and you can service the interrupt a bit quicker, but the ones that have already been given 100%-guaranteed free rein to commit? mmm no you have to wait for them. (why? because you don't actually know their issue-order, because you only recorded the R-W *dependencies* as a Directed Acyclic Graph, not the actual *instruction* order)
>
> (you can if you like actually create a periodic otherwise-purposeless "Shadow" which is cast across absolutely everything, cancelling it periodically and re-establishing it. its sole purpose would be to give you the opportunity to drop absolutely everything on the floor in case of an ultra-high priority - NMI - interrupt).

In this context I include exceptions with interrupts.
So all loads and stores, floats, and integer divide can trap.
You can only write back to a register when you know
that no older instruction can trap otherwise the
reordering effects become visible.

In other words, you can only commit the results to registers in order.
However forwarding can take place in any order.

Furthermore, because your RS are valueless, all operands have
to be ready and forwarded at the same time. So that means all
FU have to retain their results until all dependents have started.

If there are multiple dependents that might mean multiple
transmissions of each result. Or you can only proceed when
all operands of all dependents are ready together.
Which means a huge N*M forwarding matrix.
(Which I see below you send operands one by one so that
eliminates this potential problem).

>> No, no CAM required. There is only one current version number
>> for each arch/physical register so just a single "==" compare.
>
> my concern is: if that's per arch/physical register then it's effectively a CAM (or the same cost as). if there's one "==" per arch/physical, let's say that's 4 bits, that's 4 XOR gates (plus a big-AND) and if there's 128 regs that's... mmm... i know i'm going to get corrected on this... 4x5x128 = 2560 gates.
>
>> if (RegFileStatus[resultBus.DstReg].Version == resultBus.Version)
>
> 20 gates for the == if it's 4 bits.

Well, you have 128 registers, so the result data bus has to drive
128 loads, and the version wires have to drive 128 NAND loads.
The register# 7:128 decode AND's the version bits which
XNOR with that register version bits, and the XNORS AND
to enable the register data to clock in.

regEnable = Decode7to128 (resultBus.DstReg);
doWrite = AND (XNOR( AND(resultBus.Version[0],regEnable)),
XNOR( AND(resultBus.Version[1],regEnable)));
if (doWrite)
RegFileData[resultBus.DstReg] <= resultData;

>> I'm not sure this versioning approach would actually improve anything.
>>
>> IIRC you are using valueless Reservations Stations,
>
> yes. the reason they're valueless is because they're one-to-one connected to their Function Unit. so unlike Tomasulo (which can have a queue of RSes per FU) you don't actually need to give the RSes names (or numbers).
>
>> meaning the RS pull their values from the reg file when
>> all sources operands are ready and FU issues to execute.
>
> actually i modified the original design to pull source operands individually (and write independently as well). it was a pig. the code complexity is horrendous (a multi-layered FSM), but hey.

So when all operands are ready they get pulled one by one
and the uOp executes. If those operands are held in FU's then
they come from a forwarding bus rather than the RegFile.

Ok, that eliminates the need to forward everything at once.

>> That means there is no place to stash the forwarded values
>> of alternate register versions and it has to be held in
>> the FU until it is the oldest and allowed to write back.
>
> aa... aa... no, and this was precisely one of the reasons why i converted the original 6600 to what i called "MultiCompUnit" (individual source and result GO_RD/REQ and GO_WR/REQ signals), so as to be able to receive individually-forwarded source operands, rather than it be all-or-nothing.
>
> the (appx 15th iteration) of the LD-ST CompUnit for POWER9 illustrates it:
> https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> it has 3 completely separate GO_RD/REQ signal pairs and 2 completely separate GO_WR/REQ signal pairs (POWER9 can do load/store-with-update).
>
> or this one:
> https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-compunit_multi_rw.jpg
>
> which if you've seen the original from Mitch's book chapters you can recognise it:
> https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg

Sadly Mitch's book was too big for my ISP mail system (limit of 10 MB).

> no wait... i think i might have misunderstood what you wrote. give me a sec.. nope it's ok.
>
> * there is a place to stash the forwarded values,
> * yes everything that's still got dependencies has to be held,
> * no it doesn't have to wait until it's the oldest (explained above about separate DAGs),
> * yes on the allowed to write back (if not shadowed)

Ok, this is the multiple forwardings I mentioned above.

>> That is going to keep FU's allocated until their results
>> are allowed to retire at result writeback.
>
> this is just part and parcel of scoreboards (and Tomasulo). and it's not as much of a problem as it seems. the trick that Mitch came up with is: you put multiple RSes on the front of the *same* pipeline. they're considered "separate Computational Units" but they share (contend for) the one pipeline.
>
> actually if you're using pipelines you have to do that. the minimum number of front-end RSes (aka CompUnits) needed per pipeline is the length of the pipeline itself. if you have less, then you have no way of tracking the results. 3 RS front-ends on a 4-long pipeline will always, always run 25% empty, guaranteed.

It depends on the latency that an RS can be reused.

Maybe we're using the terminology differently.
To me, RS's hold pending uOp operations.
If a FU is pipelined, like an FPU, when the RS operands are all ready
it hands of to the pipeline and are available for subsequent operations.
The uOp and its operand values propagate down the pipeline stages.

So if one has 3 RS on a function unit then one can park 3 pending
uOps in them, which allow the Dispatcher to continue handing off
uOps from the front end to back end.

If an FPU has 1 RS, and it issued and was available for the very
next cycle, and Dispatch had another FPU uOp ready to hand off,
and the FU scheduler was able to determine the operand ready
status and schedule it for the next clock,
then a single RS could feed a pipeline on back to back clocks.

> apologies, as that's quite long i'll go over the rest tomorrow, and leave you with this insight: the 6600/scoreboard system is designed around RaW and WaR hazards. if advance renaming can remove all WaW hazards then that leaves the traditional scoreboard free and clear to deal with RaW+WaR.
>
> l.

Yes.


luke.l...@gmail.com

unread,
Nov 2, 2020, 7:30:21 AM11/2/20
to
On Monday, November 2, 2020 at 2:49:15 AM UTC, EricP wrote:
> luke.l...@gmail.com wrote:
> > funnily enough the commits don't necessarily happen in order in a straight 6600 (let alone precise-capable or multi-issue augmented) because the combination of FU-REGs (which identifies - tags - read-hazards and write-hazards) and the FU-FU (which is where the inter-FU result-order dependencies are recorded) can have batches of instructions that have absolutely nothing to do with each other: no inter-dependencies such as:
> Yes, that's what I thought looking at the CDC 6600 logic.
> To my eye, a consequence is that loads and stores can be out of order.

yes, and Mitch describes a LD-ST Dep-Matrix system in his book chapters to take care of that. if you email me (lk...@lkcl.net) we can work out a way to get you those book chapters (private ftp download rather than send-by-email). i take it you're happy to agree to Mitch's conditions of giving him credit if you use anything in them?


> > ah the Shadow Matrices themselves give you precise interrupts.
> In this context I include exceptions with interrupts.
> So all loads and stores, floats, and integer divide can trap.
> You can only write back to a register when you know
> that no older instruction can trap otherwise the
> reordering effects become visible.

this instruction-order-preservation indeed occurs however as a sort-of-side-effect of how Shadow Matrices work. once a Shadow is enabled it continues to throw across all other (newer) instructions until such time as it retires or cancels. that *includes* over other additionally-shadowed instructions which OR their "holds" together. so a LD shadow across an instruction is ORed with a... FP-div shadow.

> In other words, you can only commit the results to registers in order.

not quite: whilst the Shadow Matrices preserve the instruction order, once released (once a batch of instructions are free and clear of *all* Shadows) those instructions (their results) are free and clear to commit in any order (because by definition they're all guaranteed to commit 100% successfully, being free of hazards, so why add unnecessary logic proscribing their order?)

summary: the *prevention* is in-order but the [shadow-free] commits are not.

> However forwarding can take place in any order.

yes, definitely.

> Furthermore, because your RS are valueless,

"nameless" i call it. or "the RS's input latches" and "the RS's output latches".

> all operands have
> to be ready and forwarded at the same time. So that means all
> FU have to retain their results until all dependents have started.

yes.

> If there are multiple dependents that might mean multiple
> transmissions of each result.

yes. the context here is that all this time we're under Shadow conditions (so cannot commit to regfile), so it's being OpFwed, multiple times. potentially broadcast with multiple dependents receiving at the same time, which would be nice (but might overload the buffer line drivers, have to watch out for that).

> Or you can only proceed when
> all operands of all dependents are ready together.

yuck :)

> Which means a huge N*M forwarding matrix.

if you only have one GO_READ/REQ flag per CompUnit and multiple source operands you need much smaller Dependency Matrices. however the probability of both operands happening to be simultaneously available via OpFWDing is so slim that i considered it a waste of time. hence going for individual operand GO_READ/REQ lines.

> (Which I see below you send operands one by one so that
> eliminates this potential problem).

makes the Dep Matrices enormous, but hey.

> Well, you have 128 registers, so the result data bus has to drive
> 128 loads, and the version wires have to drive 128 NAND loads.

sigh. i'm hoping that buffers will take care of that. only one of them will be active (you don't broadcast the result into multiple registers at once) so i think we're good, there.

> > actually i modified the original design to pull source operands individually (and write independently as well). it was a pig. the code complexity is horrendous (a multi-layered FSM), but hey.
> So when all operands are ready they get pulled one by one
> and the uOp executes.

yes. or, more accurately: if the regfile is 4R1W and there are 3 operands, up to 3 independent GO_RD/REQs will result in 3 of those 4R being activated on the same clock. so they get pulled *independently* one by one.

this also solves the problem of what to do if the instruction writes to 2 regfiles (OpenPOWER ISA load-with-update). with the regfile being only 4R1W the two writes happen in successive clock cycles rather than going "argh argh we need to write 2 values simultaneously and there's only one regfile port, argh".

> If those operands are held in FU's then
> they come from a forwarding bus rather than the RegFile.

yes.

> Ok, that eliminates the need to forward everything at once.

:)

> > which if you've seen the original from Mitch's book chapters you can recognise it:
> > https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
> Sadly Mitch's book was too big for my ISP mail system (limit of 10 MB).

hence why i took snippets of diagrams. email me, i can sort that out for you.

> > this is just part and parcel of scoreboards (and Tomasulo). and it's not as much of a problem as it seems. the trick that Mitch came up with is: you put multiple RSes on the front of the *same* pipeline. they're considered "separate Computational Units" but they share (contend for) the one pipeline.
> >
> > actually if you're using pipelines you have to do that. the minimum number of front-end RSes (aka CompUnits) needed per pipeline is the length of the pipeline itself. if you have less, then you have no way of tracking the results. 3 RS front-ends on a 4-long pipeline will always, always run 25% empty, guaranteed.
> It depends on the latency that an RS can be reused.
>
> Maybe we're using the terminology differently.

probably :) Ivan and i had a week-long discussion about terminology a few months ago, it was quite involved (very grateful to you, Ivan, for sticking it out), where another contributor here had a view of what "Function Unit" was that was too all-encompassing, and it was causing huge confusion.

the term "Phase-aware Function Unit" was what came out of that discussion, and it's a term that *does not exist in Industry-standard usage*, right across the board. it means, "a Function Unit - whether that be a pipeline or a FSM - that tracks its operands and results categorically without fail 100%, either solely in hardware *or through information preserved and tracked by the COMPILER*".

("Phase" as in the context of "Phase 1 of the building development")

> To me, RS's hold pending uOp operations.
> If a FU is pipelined, like an FPU, when the RS operands are all ready
> it hands off to the pipeline and are available for subsequent operations.
> The uOp and its operand values propagate down the pipeline stages.

pipelines themselves typically do not "track" their results. in some designs you have a parallel "timing chain" which preserves that information, externally, about when the result is going to "pop out" the other end. the COMBINATION of that timing chain with the pipeline (FU) is what i termed a "Phase-aware FU".

i cannot quite believe that in 50 years of advanced computer science with billions spent by Corporations and Universities, absolutely nobody has come up with this term such that it is in common usage today, but hey.

so it is the COMBINATION of the RSes at the front of a pipeline (FU) which make a Phase-aware FU.

> So if one has 3 RS on a function unit then one can park 3 pending
> uOps in them,

yes. and you had better not have more than 3 stages in that function unit (if it's a pipeline) because when those RSes are 100% full there *is* no room at the FUs to put a 4th piece of data into the pipeline. why? because the fundamental rule is: you absolutely absolutely categorically must track data, respecting its dependencies, 100% of the time.

> which allow the Dispatcher to continue handing off
> uOps from the front end to back end.

and it's the *RSes* job to "track" the context, allowing the result, when it pops back out the end, to "reassociate" (route) the result back to the correct one of those 3 RS "output-capturing latches"

*not* the pipelines.

in LibreSOC we do this by actually passing the RS# down through the pipeline stages, along with the intermediary data. that "context" progresses in lock-step and, on popping out the end, is used as a Routing/Muxing ID to get the result back to the RS it belongs to.


> If an FPU has 1 RS, and it issued and was available for the very
> next cycle, and Dispatch had another FPU uOp ready to hand off,
> and the FU scheduler was able to determine the operand ready
> status and schedule it for the next clock,
> then a single RS could feed a pipeline on back to back clocks.

ah: no it could not.

this is critically important to understand.

an individual RS *must not* handle/accept/process/deal-with more than one result at a time. by definition its role is specifically to deal with ONE result dependency at a time and ONE result dependency alone. regardless of the length of the pipeline, a single RS *must* wait until the result that it is waiting for pops out the other end.

(this exact same rule applies to Tomasulo RSes as it does to Scoreboard RSes, because both track single dependencies).

the moment a RS tries to feed a pipeline with multiple sources it has fundamentally violated the Dependency Tracking rules that is its responsibility to obey, and catastrophic data corruption is flat-out categorically and 100% guaranteed as a result.

this is why if you have a 3-stage pipeline you need 3 RSes: one each for every piece of data in that pipeline, because each RS tracks *one* unique set of R/W dependencies at a time.

you can have more RSes (and it is useful to do so), but less automatically and inherently means that the pipeline can never be 100% under load.

this is really important! :)

l.

luke.l...@gmail.com

unread,
Nov 2, 2020, 8:33:32 AM11/2/20
to
On Monday, November 2, 2020 at 2:49:15 AM UTC, EricP wrote:

> If an FPU has 1 RS, and it issued and was available for the very
> next cycle, and Dispatch had another FPU uOp ready to hand off,
> and the FU scheduler was able to determine the operand ready
> status and schedule it for the next clock,
> then a single RS could feed a pipeline on back to back clocks.

put another way: given that a single RS is associated with a single corresponding set of Dependency Matrix RaW-WaR-protecting cells, and only one corresponding single set of operand and result latches, what would happen if the RS decided to feed multiple operands into a pipeline? aside from anything: where the heck would it get them from? :)

l.

luke.l...@gmail.com

unread,
Nov 2, 2020, 5:18:10 PM11/2/20
to
On Monday, November 2, 2020 at 1:00:28 AM UTC, luke.l...@gmail.com wrote:
> ha, hilarious. you'll like this. what scheme works really well to track multi-issue dependencies by using transitivw relationships?
>
> surpriiise: FUREGs when combined with FU-FU.

the idea here is that a VRegs-VRegs Matrix would record (whether from the same multi-issue batch or not) the chain of ISAregs overwrites.

let's take the LD r9, r3 ADD r9,r9 ST r9 example

* LD r9 r3 results in VREG (V0) being allocated to r9, and V0 to r3
* in the ADD the src reg is V0, however the dest r9 is new.
- V2 is allocated to this new r9 AND
- because it is a WaW on r9 an entry at (V0, V2) in the V-V matrix lights up

round the 2nd loop the same thing happens, this time lighting up (V2, V3) because the 2nd LD r9 gets allocated a new Vreg (V3) which overwrites the V2 version of r9.

note that there cannot exist overwrites ISAreg r8 over r9!

the similarity to the entire DM system for covering RaW/WaR haxards is so startlingly and tantalisingly similar.

l.

MitchAlsup

unread,
Nov 2, 2020, 5:26:34 PM11/2/20
to
If you encoded the DM as a subroutine you could call it from either location!
>
> l.

luke.l...@gmail.com

unread,
Nov 3, 2020, 7:00:08 AM11/3/20
to
you jest however because we are using a python-based HDL known as nmigen it is not only parameteriseable we can use OO software techniques to extend base classes with further HDL functionality. this means unit tests on base classes can help stabilise further work in an incremental fashion.

in this case i can subclass the FU-REGs Matrix to simply add the regfile port redirection capability per cell as outlined in the jpg a couple of days ago. hence the focus on identifying the similarities because of the insane amount of time it would save.

continuing the ridiculousness, the parallels continue to emerge. if we view "FU" to be "VRegs" it is even the case that the "Issue" signal (of each FU) becomes "VReg issue" and it is exactly as it sounds.

there is even a need for a PriorityPicker to select (issue) the first available VReg!

consider this: the combination of Shadow, FU-REGs and FU-FU Matrices does nothing more than track resource relationships between two entities, preserving the order in Directed Acyclic Graphs.

that there should be an identical relationship between ISARegs and VirtualRegs should not in the least be surprising.

i think therefore the best course of action would be to do a single-issue WaW system first, then look at multi-issue separately where it has to be added to the RaW/WaR DMs as well.

l.

luke.l...@gmail.com

unread,
Nov 11, 2020, 1:55:32 PM11/11/20
to
On Tuesday, November 3, 2020 at 12:00:08 PM UTC, luke.l...@gmail.com wrote:

> consider this: the combination of Shadow, FU-REGs and FU-FU Matrices does nothing more than track resource relationships between two entities, preserving the order in Directed Acyclic Graphs.
>
> that there should be an identical relationship between ISARegs and VirtualRegs should not in the least be surprising.
>
> i think therefore the best course of action would be to do a single-issue WaW system first, then look at multi-issue separately where it has to be added to the RaW/WaR DMs as well.

pondering continuously in the background: i believe that for multi-issue it would indeed make sense to create transitive WaW relationships (WaW...aW) between registers of the same ISA name (r9) in the same batch.

coming back to the original example, LD ADD ST, all on r9, the only thing needed would be detection of the reg# being the same.

l.

pec...@gmail.com

unread,
Nov 26, 2020, 9:32:28 AM11/26/20
to
czwartek, 29 października 2020 o 00:11:56 UTC+1 MitchAlsup napisał(a):
> > I haven't heard the term/acronym VVM before and wasn't able to find anything relevant by googling it, what is it?
> The Virtual Vector Method is an architectural extension to the My 66000
> instruction set that enables vectorization of almost any loop with exactly
> 2 instructions added to the instruction set.
I don't like VVM name, the better one is a "dynamic vectorization", but unfortunately that term was used 20 years ago for instruction window building.
Vectorized Loop eXecution (VLX) is descriptive and has better "marketing" potential.
As a scientific term - Hinted Loop Vectorization.

MitchAlsup

unread,
Nov 26, 2020, 11:01:37 AM11/26/20
to
On Thursday, November 26, 2020 at 8:32:28 AM UTC-6, pec...@gmail.com wrote:
> czwartek, 29 października 2020 o 00:11:56 UTC+1 MitchAlsup napisał(a):
> > > I haven't heard the term/acronym VVM before and wasn't able to find anything relevant by googling it, what is it?
> > The Virtual Vector Method is an architectural extension to the My 66000
> > instruction set that enables vectorization of almost any loop with exactly
> > 2 instructions added to the instruction set.
> I don't like VVM name,

Why not, it enabled vector style calculations without the overheads of vector
machines (Register file,...) Thus, it enables vector processing virtually!
Reply all
Reply to author
Forward
0 new messages