Wide issue variable length ISA

MitchAlsup

unread,

Feb 1, 2021, 2:03:59 PM2/1/21

to

As everyone here knows, the My 66000 ISA has variable length instructions, consisting of an instructions specifier (first word) and operand constants
(immediates and displacements). And even though the use of these constants
is down in the 5% range, sometimes one gets several constants in "just a
few instructions"

As one contemplates a wide issue machine (6-8 instructions--wide issue in
the RISC sense, not in the MILL sense) these constants provide a bit of a
problem for certain kinds of FETCH-INSERT front end microarchitectures.
So, I have been thinking about how to solve these problems for a "while".

Some of you might remember that Shebanow and I did the Mc 88120--a
6-wide 88K ISA back in 1990-92,.... And I led a AMD effort (K9) to do an
8-wide 2-taken branches per cycle machine 1999-2004.....

I was thinking about how to take much of that basic infrastructure and
implementing a similarly-wide My 66000 machine.

Some of the characteristics are not directly obvious to those who have not
looked deeply into such machines:: a few pertinate characteristics are:
a) Issuing instructions simultaneously from non-sequential locations
b) pre-routing instructions to the necessary function unit
c) encoding "all" next paths into <effectively> the cache-tag
d) doing the above without losing the concept of precise interrupts

Both the RISC machine and the x86-64 machine used the above philoso-
phy.

The Mc 88120 did not have constants other than the 16-bit immediates,
so the above was straight forward, just pack instructions into the "packet"
after routing (positioning) them appropriate to the required function unit.

The x86-64 did have constants and a "few" other headaches, and we
ultimately decided to sacrifice instruction "slots" in the packet (which
we called a trace) to hold those constants and encoded fields in the
instruction slots to pick up those constants from "an easily decoded"
other instruction slot. Since x86-64 only had 32-bit constants (other
thatn the 64-bit displacement) this was seen as simply a decoding
effort.

For the My 66000 ISA I wanted a "better" scheme, and a couple of days
ago, my subconscious barfed up a good idea on the matter--based on
sound underlying ideas::
a) since we have precise execptions--we CAN figure out where the
constant is in the instruction address space,
b) Since we have a scorebord/station based instruction scheduling
mechanism, the instruction can wait for its constant just as easily
as any other operand,
c) since we will be executing packets most of the time, the ICache is
idle, so we can fetch the constants from the ICache ! and route these
as "use once" operands ! (or "Scalar Operands" when VVM is active)
d) therefore, we need allocate no space in the instruction slots for
constants!
e) all we need is an execution window "deep" enough to absorb this
additional latency.

This simplifies the design of the packet and the process of issuing
instructions into the stations. The packet consists of 6 (or 8) instruction
"slots" 3-4 bits bigger than a 32-bit word, a tag to see if we fetched the
correct packet (about 48-bits), and a couple of next addresses called
sequential and predict (about 16-bits each) and a 4-bit token that tells
the fetch unit how to for the next fetch address. That token controls
a multiplexer that takes the tag, sequential, taken, and the top of the
Call-return stack, and the Jump table, and forms the next fetch address.
This process is about 4 gates of delay (decode and multiplex) and wire
delay. So, like K9, all of this logic remains "in" the packet cache, close
to the decode inputs for the packet SRAMs........

The decode process does not have to decode the instruction as the
instruction has alreaady been routed to the proper function unit, just
to recognize a few patterns and setup the register file reads and
constant reads and drop same in the instruction stations. So while
smaller My 66000 implementations use a 2 stage PARSE and DECODE
pipeline, the wider versions can get by with a single stage INSERT
pipeline stage.

EricP

unread,

Feb 1, 2021, 5:04:04 PM2/1/21

to

Are you tossing the in-line constant away from the fetch buffer
and passing the address of that constant as a uOp argument,
then refetching the constant from I$ at issue time?

> This simplifies the design of the packet and the process of issuing
> instructions into the stations. The packet consists of 6 (or 8) instruction
> "slots" 3-4 bits bigger than a 32-bit word, a tag to see if we fetched the
> correct packet (about 48-bits), and a couple of next addresses called
> sequential and predict (about 16-bits each) and a 4-bit token that tells
> the fetch unit how to for the next fetch address. That token controls
> a multiplexer that takes the tag, sequential, taken, and the top of the
> Call-return stack, and the Jump table, and forms the next fetch address.
> This process is about 4 gates of delay (decode and multiplex) and wire
> delay. So, like K9, all of this logic remains "in" the packet cache, close
> to the decode inputs for the packet SRAMs........
>
> The decode process does not have to decode the instruction as the
> instruction has alreaady been routed to the proper function unit, just
> to recognize a few patterns and setup the register file reads and
> constant reads and drop same in the instruction stations. So while
> smaller My 66000 implementations use a 2 stage PARSE and DECODE
> pipeline, the wider versions can get by with a single stage INSERT
> pipeline stage.

I don't understand the problem. Are you concerned about the
amount of uOp register space taken up to hold these constant?

In the front end each uOp has to optionally carry a single 64-bit constant,
which can be an integer or float. For PC-relative addressing
the Decoder can do the add of PC+instruction_size+offset
so the uOp doesn't need to carry about multiple constants,
the PC, the instruction size, and the offset, as separate fields.
(Decoder needs a 64-bit adder anyway so it can redirect Fetch ASAP
for branches.)

In the back end, one of the reasons I liked valued Reservation Stations
is that then the constant becomes just one of the operands passed
at Dispatch to a Reservation Station where it is held.
With valueless Reservation Stations, or the everything-in-ROB approach,
then, yes, the constant has to be held in the ROB until the uOp is issued
to function unit, and that requires extra ROB read ports at issue
and muxes to route the value to the operand buses.

I considered an alternate approach, with a circular buffer attached to
Decode where constants are temporarily stashed and the buffer index
passed as a uOp argument, which eliminates them from the central ROB.
But I decided to have valued R.S. so it wasn't necessary.
Also, thinking about this in retrospect, that circular buffer would
have to be part of checkpoint-rollback so would add more complexity.

But valued R.S. solved that problem plus gave a place to hold forwarded
operand values, which eliminates having to re-read those values from
register file or ROB at issue time and the associated ports,
and those same R.S. registers were used to store the forwarding
match tag bits so it didn't really cost anything.

MitchAlsup

unread,

Feb 1, 2021, 6:06:19 PM2/1/21

to

I am tossing the constants from the packet--I will encode some way to
fetch the constants from the ICache probably taking zero actual bits.
After all, we have the address of the first instruction and the address
of the last instruction (even if we have "taken" a jump "in" the packet.)
So a few small offsets to one or the other forms the addresses to fetch.

> > This simplifies the design of the packet and the process of issuing
> > instructions into the stations. The packet consists of 6 (or 8) instruction
> > "slots" 3-4 bits bigger than a 32-bit word, a tag to see if we fetched the
> > correct packet (about 48-bits), and a couple of next addresses called
> > sequential and predict (about 16-bits each) and a 4-bit token that tells
> > the fetch unit how to for the next fetch address. That token controls
> > a multiplexer that takes the tag, sequential, taken, and the top of the
> > Call-return stack, and the Jump table, and forms the next fetch address.
> > This process is about 4 gates of delay (decode and multiplex) and wire
> > delay. So, like K9, all of this logic remains "in" the packet cache, close
> > to the decode inputs for the packet SRAMs........
> >
> > The decode process does not have to decode the instruction as the
> > instruction has alreaady been routed to the proper function unit, just
> > to recognize a few patterns and setup the register file reads and
> > constant reads and drop same in the instruction stations. So while
> > smaller My 66000 implementations use a 2 stage PARSE and DECODE
> > pipeline, the wider versions can get by with a single stage INSERT
> > pipeline stage.

> I don't understand the problem. Are you concerned about the
> amount of uOp register space taken up to hold these constant?

When I size the packet for 6 (or 8) instructions I want to be able to
put 6 (or 8) instructions in the packet--even when I have to fetch
up to 6 (or 8) constants fromthe ICache. That is:: don't waste bits.

>
> In the front end each uOp has to optionally carry a single 64-bit constant,

Store Immediates can have a 64-bit value to be stored, AND a 64-bit
displacement to where it needs to be stored!

> which can be an integer or float. For PC-relative addressing
> the Decoder can do the add of PC+instruction_size+offset
> so the uOp doesn't need to carry about multiple constants,
> the PC, the instruction size, and the offset, as separate fields.
> (Decoder needs a 64-bit adder anyway so it can redirect Fetch ASAP
> for branches.)

In the design mentioned, there is no ADD in the FETCH-FETCH path !
In fact you don't know any subsequent fetch address until the current
packet shows up.

>
> In the back end, one of the reasons I liked valued Reservation Stations
> is that then the constant becomes just one of the operands passed
> at Dispatch to a Reservation Station where it is held.

Yes, this is where I started--with valued reservaton stations.

> With valueless Reservation Stations, or the everything-in-ROB approach,
> then, yes, the constant has to be held in the ROB until the uOp is issued
> to function unit, and that requires extra ROB read ports at issue
> and muxes to route the value to the operand buses.

The constant has a peculular property--it is known to be used-once !!
Thus it could be put somewhere other than ROB.
{Completely sidestepping the fact that the design will not have an ROB}

>
> I considered an alternate approach, with a circular buffer attached to
> Decode where constants are temporarily stashed and the buffer index
> passed as a uOp argument, which eliminates them from the central ROB.
> But I decided to have valued R.S. so it wasn't necessary.
> Also, thinking about this in retrospect, that circular buffer would
> have to be part of checkpoint-rollback so would add more complexity.

There are lots of good microarchitectural choices in this less than optimally
explored design space; and a few poor choices, too.

>
> But valued R.S. solved that problem plus gave a place to hold forwarded
> operand values, which eliminates having to re-read those values from
> register file or ROB at issue time and the associated ports,
> and those same R.S. registers were used to store the forwarding
> match tag bits so it didn't really cost anything.

Yes, yes.

Can I ask:: Did you manage to put the stations in the pipeline without giving
them a clock?

In the Mc 88120, we could issue instructions into the reservation station.
When the station was not alredy issuing an instruction, we allowed the issuing
instruction to proceede directly into execution (assuming it had its operands).
So, in effect, the RS did not have a stage in the pipe, but were simply feedback
paths much like forwarding (but with storage.)

Stephen Fuld

unread,

Feb 2, 2021, 11:27:17 AM2/2/21

to

I presume that would all work, but I have a question. Why the icache
versus instruction fetch buffers? While you might need more buffers,
and a mechanism to keep them from being overwritten until the constants
have been used, they are closer, so should require less time and power
to access.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Timothy McCaffrey

unread,

Feb 2, 2021, 1:14:10 PM2/2/21

to

On Monday, February 1, 2021 at 2:03:59 PM UTC-5, MitchAlsup wrote:

> For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> ago, my subconscious barfed up a good idea on the matter--based on
> sound underlying ideas::
> a) since we have precise execptions--we CAN figure out where the
> constant is in the instruction address space,
> b) Since we have a scorebord/station based instruction scheduling
> mechanism, the instruction can wait for its constant just as easily
> as any other operand,
> c) since we will be executing packets most of the time, the ICache is
> idle, so we can fetch the constants from the ICache ! and route these
> as "use once" operands ! (or "Scalar Operands" when VVM is active)
> d) therefore, we need allocate no space in the instruction slots for
> constants!
> e) all we need is an execution window "deep" enough to absorb this
> additional latency.
>

OTOH, it seems like if you are going to fetch the constant from iCache anyway, you could just
encode the constant as a short offset from the instruction pointer. Thereby, the shorter instruction
gives you more fetch bandwidth. Of course, this makes code generation annoyingly complicated...

- Tim

EricP

unread,

Feb 2, 2021, 1:15:58 PM2/2/21

to

Ok, I just wanted to make sure this wasn't me
having some kind of cerebral hemorrhage.

>>> This simplifies the design of the packet and the process of issuing
>>> instructions into the stations. The packet consists of 6 (or 8) instruction
>>> "slots" 3-4 bits bigger than a 32-bit word, a tag to see if we fetched the
>>> correct packet (about 48-bits), and a couple of next addresses called
>>> sequential and predict (about 16-bits each) and a 4-bit token that tells
>>> the fetch unit how to for the next fetch address. That token controls
>>> a multiplexer that takes the tag, sequential, taken, and the top of the
>>> Call-return stack, and the Jump table, and forms the next fetch address.
>>> This process is about 4 gates of delay (decode and multiplex) and wire
>>> delay. So, like K9, all of this logic remains "in" the packet cache, close
>>> to the decode inputs for the packet SRAMs........
>>>
>>> The decode process does not have to decode the instruction as the
>>> instruction has alreaady been routed to the proper function unit, just
>>> to recognize a few patterns and setup the register file reads and
>>> constant reads and drop same in the instruction stations. So while
>>> smaller My 66000 implementations use a 2 stage PARSE and DECODE
>>> pipeline, the wider versions can get by with a single stage INSERT
>>> pipeline stage.
>
>> I don't understand the problem. Are you concerned about the
>> amount of uOp register space taken up to hold these constant?
>
> When I size the packet for 6 (or 8) instructions I want to be able to
> put 6 (or 8) instructions in the packet--even when I have to fetch
> up to 6 (or 8) constants fromthe ICache. That is:: don't waste bits.

Ok, I see below that you have some instructions like store-immediate
with two 64-bit constants. I take it the problem is that you don't
want to size the uOp registers with the worst case 128 bits when
only relatively few instructions would use them.

>> In the front end each uOp has to optionally carry a single 64-bit constant,
>
> Store Immediates can have a 64-bit value to be stored, AND a 64-bit
> displacement to where it needs to be stored!

Yeah, I don't have store-immediate with its possible
constant pair of offset and immediate value.

>> which can be an integer or float. For PC-relative addressing
>> the Decoder can do the add of PC+instruction_size+offset
>> so the uOp doesn't need to carry about multiple constants,
>> the PC, the instruction size, and the offset, as separate fields.
>> (Decoder needs a 64-bit adder anyway so it can redirect Fetch ASAP
>> for branches.)
>
> In the design mentioned, there is no ADD in the FETCH-FETCH path !
> In fact you don't know any subsequent fetch address until the current
> packet shows up.

?????
Then how does unconditional branch/call work?
If you wait until the Branch Control Unit actually processes
the branch/call instruction, then that is N clocks later
and injects a bubble in the front end.
On 8-wide issue, N clocks is an N*8 instruction bubble.

>> In the back end, one of the reasons I liked valued Reservation Stations
>> is that then the constant becomes just one of the operands passed
>> at Dispatch to a Reservation Station where it is held.
>
> Yes, this is where I started--with valued reservaton stations.
>
>> With valueless Reservation Stations, or the everything-in-ROB approach,
>> then, yes, the constant has to be held in the ROB until the uOp is issued
>> to function unit, and that requires extra ROB read ports at issue
>> and muxes to route the value to the operand buses.
>
> The constant has a peculular property--it is known to be used-once !!
> Thus it could be put somewhere other than ROB.
> {Completely sidestepping the fact that the design will not have an ROB}

Yeah, and a ROB-centralized designs put everything in one place,
then complains about having so many ports on the ROB file.
Doctor, doctor, it hurts when I do this....

>> I considered an alternate approach, with a circular buffer attached to
>> Decode where constants are temporarily stashed and the buffer index
>> passed as a uOp argument, which eliminates them from the central ROB.
>> But I decided to have valued R.S. so it wasn't necessary.
>> Also, thinking about this in retrospect, that circular buffer would
>> have to be part of checkpoint-rollback so would add more complexity.
>
> There are lots of good microarchitectural choices in this less than optimally
> explored design space; and a few poor choices, too.
>
>> But valued R.S. solved that problem plus gave a place to hold forwarded
>> operand values, which eliminates having to re-read those values from
>> register file or ROB at issue time and the associated ports,
>> and those same R.S. registers were used to store the forwarding
>> match tag bits so it didn't really cost anything.
>
> Yes, yes.
>
> Can I ask:: Did you manage to put the stations in the pipeline without giving
> them a clock?

No, the RS are clocked, though I was still hoping to use latches
rather than FF. The critical path is forwarded values for
back-to-back instructions. Latches allows a forwarded value to
flow through from the result bus, through an RS, directly into the
calculation unit, then the RS latch closes holding the value.

The timing of this RS sample-hold action would be dependent on the
propagation delay on the result bus, I'm guessing 1/4 cycle delay
after the result is broadcast.

> In the Mc 88120, we could issue instructions into the reservation station.
> When the station was not alredy issuing an instruction, we allowed the issuing
> instruction to proceede directly into execution (assuming it had its operands).
> So, in effect, the RS did not have a stage in the pipe, but were simply feedback
> paths much like forwarding (but with storage.)

This sounds like the same as I describe above.
Since this back-to-back forwarding is the critical path
I also came up with a neat logic hack idea if the propagation
delay through the RS pathway was too slow.

(This really requires a white board but I'll give it a shot anyway.)
Below all takes place within 1 clock cycle.
It starts at the rising edge of the clock,
and the calculation result is saved in a FF at the next rising edge
for 1 cycle operations like ALU.

If the path along the result bus, through the routing muxes,
through the RS, through more routing muxes, into the Calculation Unit (CU)
was too long, there could be a "fast path" and "slow path"
for operand routing.

The problem is how to switch the CU operand input from the
fast path to slow path without glitching.
Ordinarily a mux selects either the left or right input,
and it can glitch the output when switching between them.

An AND-OR mux circuit that would route a forwarded operand directly
into the CU operand so it can begin its calculation immediately.
The result would also route through the slow path, the RS latches,
the RS output bus, etc.

Later the RS latch closes to hold the forwarded value and puts it
onto the RS output bus leading to the CU.

Later the "fast path" mux enables the second path
from the RS latch while keeping fast result enabled.
Since the values on the fast and slow paths are the same and
it goes through an AND-OR mux, it OR's to values that are the same
so there shouldn't be a glitch while it switches paths.

Later the fast path input is disabled and the
RS holds the input operand to the CU.

At the rising edge of the next clock the CU result is clocked into
a result FF, and possible driven out onto the tri-state result bus
for storage in the register file or forwarding.

MitchAlsup

unread,

Feb 2, 2021, 1:46:55 PM2/2/21

to

On Tuesday, February 2, 2021 at 12:14:10 PM UTC-6, timca...@aol.com wrote:
> On Monday, February 1, 2021 at 2:03:59 PM UTC-5, MitchAlsup wrote:
>
>
> > For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> > ago, my subconscious barfed up a good idea on the matter--based on
> > sound underlying ideas::
> > a) since we have precise execptions--we CAN figure out where the
> > constant is in the instruction address space,
> > b) Since we have a scorebord/station based instruction scheduling
> > mechanism, the instruction can wait for its constant just as easily
> > as any other operand,
> > c) since we will be executing packets most of the time, the ICache is
> > idle, so we can fetch the constants from the ICache ! and route these
> > as "use once" operands ! (or "Scalar Operands" when VVM is active)
> > d) therefore, we need allocate no space in the instruction slots for
> > constants!
> > e) all we need is an execution window "deep" enough to absorb this
> > additional latency.
> >
> OTOH, it seems like if you are going to fetch the constant from iCache anyway, you could just
> encode the constant as a short offset from the instruction pointer.

I do this in the packet not in the ISA. And the constant accesses atr purely synthetic
(invented by the issue stage of the pipeline.)

> Thereby, the shorter instruction
> gives you more fetch bandwidth. Of course, this makes code generation annoyingly complicated...

A reason NOT to do it that way.
>
> - Tim

MitchAlsup

unread,

Feb 2, 2021, 1:57:07 PM2/2/21

to

Instructions leading up- to the BR/CALL and instructions at the target
of the BR/CALL can both be in a single packet !! For this scenario,
both the SEQ and TAKEN fields in the "tag" are used to form the target/call
address. The formed address is to the first instruction that did not nake
it into the packet.

> If you wait until the Branch Control Unit actually processes
> the branch/call instruction, then that is N clocks later
> and injects a bubble in the front end.

You have to do this without anything more than a multiplexer if you want
to get back-to-back accesses.

> On 8-wide issue, N clocks is an N*8 instruction bubble.

Which is why N has to be 0

See, I rather thought about it thoe other way:: there is a slow path
and a fast path. The slow path knows if it is being used (has an
instruction to issue, and if not, it enables the fast path.

> Ordinarily a mux selects either the left or right input,
> and it can glitch the output when switching between them.
>
> An AND-OR mux circuit that would route a forwarded operand directly
> into the CU operand so it can begin its calculation immediately.
> The result would also route through the slow path, the RS latches,
> the RS output bus, etc.

Yes, yes.

>
> Later the RS latch closes to hold the forwarded value and puts it
> onto the RS output bus leading to the CU.
>
> Later the "fast path" mux enables the second path
> from the RS latch while keeping fast result enabled.
> Since the values on the fast and slow paths are the same and
> it goes through an AND-OR mux, it OR's to values that are the same
> so there shouldn't be a glitch while it switches paths.

Clever.....but sounds pretty much like what I tried to explain....

Paul A. Clayton

unread,

Feb 4, 2021, 12:16:56 PM2/4/21

to

On Monday, February 1, 2021 at 2:03:59 PM UTC-5, MitchAlsup wrote:

> As everyone here knows, the My 66000 ISA has variable length instructions, consisting of an instructions specifier (first word) and operand constants
> (immediates and displacements). And even though the use of these constants
> is down in the 5% range, sometimes one gets several constants in "just a
> few instructions"

[snip background information]

> For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> ago, my subconscious barfed up a good idea on the matter--based on
> sound underlying ideas::
> a) since we have precise execptions--we CAN figure out where the
> constant is in the instruction address space,
> b) Since we have a scorebord/station based instruction scheduling
> mechanism, the instruction can wait for its constant just as easily
> as any other operand,
> c) since we will be executing packets most of the time, the ICache is
> idle, so we can fetch the constants from the ICache ! and route these
> as "use once" operands ! (or "Scalar Operands" when VVM is active)
> d) therefore, we need allocate no space in the instruction slots for
> constants!
> e) all we need is an execution window "deep" enough to absorb this
> additional latency.

I do not like this because it reads from a large general
structure when a smaller (likely more specialized) structure
might be used. (This dislike is not informed by a deeply trained
intuition — irregular sleep and other factors also hinder mental
acuity — much less an actual analysis of hardware tradeoffs but
does fit the principle that an optimization should give greater
weight to entangled factors even if they are uncommon.)

(Yes, this design exploits the otherwise idle Icache, more
efficiently packs instructions in the packet cache both for
storage and for initiating execution, and exploits some
attributes of constant values. Because larger constants are
uncommon, the energy cost of accessing Icache occasionally would
still be much lower than always accessing Icache in parallel with
a packet cache, but such still bothers me.)

For small/modest-sized loops, including non-vectorized loops,
constants will have reuse of with significant temporal locality.
It might also be noted that loops have internally constant
values, which hints that a mechanism might exploit this factor.

Even outside of loops, there are "pseudo-constants" which might
benefit from different handling than other variables. (Values
incremented by instruction constants — or even pseudo-constants —
may also present an optimization/specialization opportunity.)
Since a packet cache is caching decode and other information on
the presumption of reuse before replacement, one might presume
constant reuse.

It may be that insufficient additional uses of a specialized
storage structure could be provided to justify such and worst-
case sizing may be problematic. (I suspect sacrificing
performance might be acceptable for pathological cases, but there
are likely reasonable use cases that would be difficult to
handle. Uncommon cases with high utilization of a facility are
difficult optimization problems.) Even with storage provided for
constants, such might "overflow" to the Icache, increasing
access energy rather than reducing effective packet-cache
capacity or issue rate.

Finding other uses of such storage may be the most significant
difficulty. While it is easy to think of uses for storage,
avoiding utilization conflicts and generalization overheads makes
such more challenging. E.g., using such for load-value prediction
(or micro-cache) and store buffer values might introduce little
overhead in the area of the specialized storage but introduce
complexity/delay in data source selection and face excess
pressure when large immediate values are common, using such as an
overflow for the packet-cache would seem to complicate the
access path (though this use might be exploited to provide
wider than usual execution when certain structural hazards are
not present — that seems flaky), using such for speculative
results would have low utilization conflict issues (coarser
grained checkpointing could be used when constants are common)
and might work with high banking (a constant cache would have to
have wide access) but supporting frequent writes might conflict
with optimization for constant storage.

MitchAlsup

unread,

Feb 4, 2021, 12:49:54 PM2/4/21

to

At the time one is fetching the "packet" one only has the address
of the starting instruction so one can only access constants in
that and subsequent cache lines.

But while building the packet, unconditional branches (and calls)
are pre-executed and do no actually remain in the packet itself.

It is only after the packet arrives that we can form all the addresses
to which constants may be required.

One of the things we did on the 88120 was to access the caches
twice per cycle, on the LD/ST side the early access was a read,
the later access a write. On the packet cche side, the first access
was a red of the next-predicted packet, and the second one was
a read of the non-predicted packet. Iin this way, should the prediction
fait, we already had the instructions ready to be inseerted into
execution at the time we know that the branch needed recovery.
Thus, we could execute a branch, detect it failed, and have instructions
at the failure point ready to be inserted into execution in a single
clock. {This helped a lot in SPEC89 XLISP and GCC.}

>
> For small/modest-sized loops, including non-vectorized loops,
> constants will have reuse of with significant temporal locality.
> It might also be noted that loops have internally constant
> values, which hints that a mechanism might exploit this factor.

In the case of VVM loops, the HW can recognize that the constants
are used over-and-over and KNOW that the constant is simply a
scalar-operand in a vectorized loop. The ICache will be idle in these
situations.

>
> Even outside of loops, there are "pseudo-constants" which might
> benefit from different handling than other variables. (Values
> incremented by instruction constants — or even pseudo-constants —
> may also present an optimization/specialization opportunity.)
> Since a packet cache is caching decode and other information on
> the presumption of reuse before replacement, one might presume
> constant reuse.

Remember, large-constants are 5%-ile instruction attachments.

MitchAlsup

unread,

Feb 21, 2021, 9:32:26 PM2/21/21

to

On Thursday, February 4, 2021 at 11:49:54 AM UTC-6, MitchAlsup wrote:

I went through the vectorized compilation of livermoore loops applying packetization
rules for a prototype packet cache, and when My 66000 seems to packetize "better
than Mc 88120 there are a few differences that stand out::

a) the FMAC instruction has significantly reduced the number of floating point inst-
ructions in the mix, and often comes in a series {FMUL, FMAC[k]}. Code scheduling
does not seem to help a "whole lot" since there are mostly LDs leading in, and STs
leading out.

b) There are a lot of opportunities to compress two (2) CMP-BC into a single packet
a good 30% of these are followed by a unconditional branch. Since the unconditional
branch only alters the sequential field in a packet: if I can find a way to handle these
two (2) CPM-BCs in a single packet, packet density improves a lot.

c) There are a lot of sequences where a bunch of STs are performed and then a bunch
of LDs, somethings the LDs and Sts are intermixed. There are other sequences where a
series of MOV shuffles values in the registers--typically setting up loops for another
iteration.

(a, b, and c) are indicative that one needs a better way to avoid packet instruction saturation
on a function unit by function unit basis.

On the other hand, there is little evidence that one needs 12 inbound registers per packet
and only modest evidence that one needs 6-result registers per packet. Thus there might
be an opportunity to move this data into some common "renamer pool" of registers and
perhaps prune some bits for the packet overall.

So, I counted the instructions (1416) and I counted the number of packets (306) I got
"eyeballing it" for a packet density of 4.6. This is to be compared with the packet
density we got in Mc 88120 which was around 4.0. So some forward progress, but
still significant ways to go.

JohnG

unread,

May 5, 2021, 3:57:05 AM5/5/21

to

On Monday, February 1, 2021 at 11:03:59 AM UTC-8, MitchAlsup wrote:
> c) since we will be executing packets most of the time, the ICache is
> idle, so we can fetch the constants from the ICache ! and route these
> as "use once" operands ! (or "Scalar Operands" when VVM is active)

Maybe a dumb question, but is it possible for the ICache entry to be evicted between when you did the fetch/decode and when you go looking for the immediate?

-JohnG

Quadibloc

unread,

May 5, 2021, 9:33:11 AM5/5/21

to

On Monday, February 1, 2021 at 12:03:59 PM UTC-7, MitchAlsup wrote:

> For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> ago, my subconscious barfed up a good idea on the matter--based on
> sound underlying ideas::
> a) since we have precise execptions--we CAN figure out where the
> constant is in the instruction address space,
> b) Since we have a scorebord/station based instruction scheduling
> mechanism, the instruction can wait for its constant just as easily
> as any other operand,
> c) since we will be executing packets most of the time, the ICache is
> idle, so we can fetch the constants from the ICache ! and route these
> as "use once" operands ! (or "Scalar Operands" when VVM is active)
> d) therefore, we need allocate no space in the instruction slots for
> constants!
> e) all we need is an execution window "deep" enough to absorb this
> additional latency.

At first I couldn't make heads or tails out of this.

But upon reflection, what I *think* you're talking about is this:

Immediate instructions 'pop' constants from a constant stack, and
the instruction stream also contains constants, which push themselves
(i.e. there's an instruction opcode that means 'push constant immediate')
with the possible additional proviso that an immediate instruction can pop
a constant 'on credit'; i.e., wait until the corresponding push constant
instruction happens, which does not need to precede the immediate
instruction in the instruction stream.

I would never have thought of this, or at least seriously considered it, for
the following reason: the 'push constant' opcode would have to be
wastefully long, given that constants are of data types that fit well into
the memory... and instructions also fit well into the _same_ memory,
with the same alignment and chunks.

So I prefer schemes which instead make use of a small number of
overhead bits in the immediate instructions. But, even though it covers
possibly multiple constants, I still also need at least one 16-bit block
header in the schemes I've devised,. which still leaves me unsatisfied.

The only possible alternative I could come up with, that I've currently rejected
as preposterous, is: devise a secondary instruction set where the instructions
are 29 bits long instead of 32 bits, and have the first instruction in each block
come from this set, thus allowing three bits in the block to say how many
instruction slots contain constants.

That minimizes overhead to keep instruction density high... but the price in
complexity is excessive.

Not that _one_ possibility doesn't still remain with me:

Organize the instruction set in such a way that the 29-bit instructions are a
natural subset of the 32-bit instructions, so that this scheme _can_ be used
without defining two instruction sets.

That is a natural extension of a principle I've already tried: have the first bit
of a 32-bit instruction slot indicate if it contains two 16-bit instructions or one
32-bit instruction, but use it to indicate prefixes instead of pairs of 16-bit
instructions in the first slot.

John Savard

Thomas Koenig

unread,

May 5, 2021, 11:33:29 AM5/5/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> That is a natural extension of a principle I've already tried: have the first bit
> of a 32-bit instruction slot indicate if it contains two 16-bit instructions or one
> 32-bit instruction, but use it to indicate prefixes instead of pairs of 16-bit
> instructions in the first slot.

I like another idea whose basic frame was developed here a few
weeks ago much better (and which I have extended a bit).

The instructions come in 64-bit bundles. They are independent of
each other, they just happen to share the same word.

Encoding is in the first bits:

Leading 0 means three instructions of 21 bit each. This is for
all the ra = ra op b instructions (which are rather frequent even
in three-operand machines), plus for stuff like load/store with
reasonably large offsets from the stack pointer (or TOC) which
occur frequently. Frequency analysis of existing RISC programs
could give a good idea which instructions should go in there.

Leading 100 and 101 means a 40 bit instruction and a 21 bit
instruction. The 40 bit instructions should be enough for all
"normal" stuff including loading 32-bit constants, rather large
branches depending on bits set in registers, and and maybe a few
four-operand instructions (like FMA or

ra = rb - rc, plus set all comparison flags in rd

or a direct

(ra, rb) = (mulh(rc, rd), mull(rc,rd))

Leading 110 means a 61-bit instruction, which could do vector
stuff with very many registers or other strange things.

Finally, 111 should be reserved for future expansion.

There is a certain danger that some 21-bit slots will not be filled
with reasonable instructions.

Jump targets in a bundle could be either the address of the bundle
plus the number of the instruction, or the byte address rounded
up or down to the next 8 bits. Either way, if there was no legal
instruction there, an exception will be raised.

I think there is a good chance that the code density could be a
bit higher than a 32/64 bit encoding scheme.

MitchAlsup

unread,

May 5, 2021, 1:02:11 PM5/5/21

to

Yes, this simply causes the instruction waiting for that constant to wait longer.
<
> -JohnG

MitchAlsup

unread,

May 5, 2021, 1:11:17 PM5/5/21

to

On Wednesday, May 5, 2021 at 8:33:11 AM UTC-5, Quadibloc wrote:
> On Monday, February 1, 2021 at 12:03:59 PM UTC-7, MitchAlsup wrote:
>
> > For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> > ago, my subconscious barfed up a good idea on the matter--based on
> > sound underlying ideas::
> > a) since we have precise execptions--we CAN figure out where the
> > constant is in the instruction address space,
> > b) Since we have a scorebord/station based instruction scheduling
> > mechanism, the instruction can wait for its constant just as easily
> > as any other operand,
> > c) since we will be executing packets most of the time, the ICache is
> > idle, so we can fetch the constants from the ICache ! and route these
> > as "use once" operands ! (or "Scalar Operands" when VVM is active)
> > d) therefore, we need allocate no space in the instruction slots for
> > constants!
> > e) all we need is an execution window "deep" enough to absorb this
> > additional latency.
> At first I couldn't make heads or tails out of this.
>
> But upon reflection, what I *think* you're talking about is this:
>
> Immediate instructions 'pop' constants from a constant stack, and
> the instruction stream also contains constants, which push themselves
> (i.e. there's an instruction opcode that means 'push constant immediate')
<

There is no such OpCode. The constants are fetched from execution memory.
Executing instructions from that memory have been placed into what I call a
packet. A packet contains places where 6 (or 8) instructions can be placed.
These instructions are 35-bits long--the register specifiers have expanded
into 6-bits to allow for intra-packet forwarding, and single use encodings.

<
> with the possible additional proviso that an immediate instruction can pop
> a constant 'on credit'; i.e., wait until the corresponding push constant
> instruction happens, which does not need to precede the immediate
> instruction in the instruction stream.
>
> I would never have thought of this, or at least seriously considered it, for
> the following reason: the 'push constant' opcode would have to be
> wastefully long, given that constants are of data types that fit well into
> the memory... and instructions also fit well into the _same_ memory,
> with the same alignment and chunks.

There is no push constant, all we need is something like a 4-bit "address"
because we have the IP of the first instruction and 99% of the time those
6 (or 8) instructions fit in less than 16 words. And note: the common 16-bit
immediates need no such accessing of the ICache.

>
> So I prefer schemes which instead make use of a small number of
> overhead bits in the immediate instructions. But, even though it covers
> possibly multiple constants, I still also need at least one 16-bit block
> header in the schemes I've devised,. which still leaves me unsatisfied.
>
> The only possible alternative I could come up with, that I've currently rejected
> as preposterous, is: devise a secondary instruction set where the instructions
> are 29 bits long instead of 32 bits, and have the first instruction in each block
> come from this set, thus allowing three bits in the block to say how many
> instruction slots contain constants.
<

If you think about what I am trying to do it essentially what you are doing
I am just trying to do it dynamically as I build packets. Doing it dynamically
I can get those 3-bits (5 in my case) without damaging the natural encoding
of the instructions.

EricP

unread,

May 6, 2021, 11:00:54 AM5/6/21

to

Some interesting tidbits...

The AMD patent on the operation cache, 2016
https://patents.google.com/patent/US20200225956A1/

refers to some microarchitecture details, presumably for Zen3,
which, IIUC, deal with IP and variable format instructions,
and how they carry around these multiple 64-bit values internally.

On x64 there can be multiple uOps for each macro-op and therefore each IP.
Each macro-op can optionally have a displacement and/or immediate constant.

If an instruction has both displacement and immediate and decodes to
multiple uOps, those constants might be used by one or more different uOps.
But we don't want all uOps to carry about 64-bit quantities everywhere.

It looks like the uArch has the decoder take the parsed
instruction and IP and split it out into 3 piles:
- the IP goes into an IP circular buffer
- the 1/2/4 byte displacement and/or immediate, if present,
into a constants circular buffer
- one or more uOps go into the uOp buffer with the indexes into
the IP and constants buffers.

Essentially they "second normal form" it.
Then presumably each circular buffer is sized according to usage stats.

IP
v
variable_instruction [displacement] [immediate]
v
decode
v
-------------------------
v v v
IP <--idx---uOp---idx-->const
queue queue queue

IIUC later when they decide to build a uOp cache entry,
they pack that back together in what looks like a heads-n-tails
bundle holding the multiple uOps and optional constants.
The IP can be tossed for uOp cache because we can recreate it
from the the IP used to fetch from the uOp cache.