Ok, I just wanted to make sure this wasn't me
having some kind of cerebral hemorrhage.
>>> This simplifies the design of the packet and the process of issuing
>>> instructions into the stations. The packet consists of 6 (or 8) instruction
>>> "slots" 3-4 bits bigger than a 32-bit word, a tag to see if we fetched the
>>> correct packet (about 48-bits), and a couple of next addresses called
>>> sequential and predict (about 16-bits each) and a 4-bit token that tells
>>> the fetch unit how to for the next fetch address. That token controls
>>> a multiplexer that takes the tag, sequential, taken, and the top of the
>>> Call-return stack, and the Jump table, and forms the next fetch address.
>>> This process is about 4 gates of delay (decode and multiplex) and wire
>>> delay. So, like K9, all of this logic remains "in" the packet cache, close
>>> to the decode inputs for the packet SRAMs........
>>>
>>> The decode process does not have to decode the instruction as the
>>> instruction has alreaady been routed to the proper function unit, just
>>> to recognize a few patterns and setup the register file reads and
>>> constant reads and drop same in the instruction stations. So while
>>> smaller My 66000 implementations use a 2 stage PARSE and DECODE
>>> pipeline, the wider versions can get by with a single stage INSERT
>>> pipeline stage.
>
>> I don't understand the problem. Are you concerned about the
>> amount of uOp register space taken up to hold these constant?
>
> When I size the packet for 6 (or 8) instructions I want to be able to
> put 6 (or 8) instructions in the packet--even when I have to fetch
> up to 6 (or 8) constants fromthe ICache. That is:: don't waste bits.
Ok, I see below that you have some instructions like store-immediate
with two 64-bit constants. I take it the problem is that you don't
want to size the uOp registers with the worst case 128 bits when
only relatively few instructions would use them.
>> In the front end each uOp has to optionally carry a single 64-bit constant,
>
> Store Immediates can have a 64-bit value to be stored, AND a 64-bit
> displacement to where it needs to be stored!
Yeah, I don't have store-immediate with its possible
constant pair of offset and immediate value.
>> which can be an integer or float. For PC-relative addressing
>> the Decoder can do the add of PC+instruction_size+offset
>> so the uOp doesn't need to carry about multiple constants,
>> the PC, the instruction size, and the offset, as separate fields.
>> (Decoder needs a 64-bit adder anyway so it can redirect Fetch ASAP
>> for branches.)
>
> In the design mentioned, there is no ADD in the FETCH-FETCH path !
> In fact you don't know any subsequent fetch address until the current
> packet shows up.
?????
Then how does unconditional branch/call work?
If you wait until the Branch Control Unit actually processes
the branch/call instruction, then that is N clocks later
and injects a bubble in the front end.
On 8-wide issue, N clocks is an N*8 instruction bubble.
>> In the back end, one of the reasons I liked valued Reservation Stations
>> is that then the constant becomes just one of the operands passed
>> at Dispatch to a Reservation Station where it is held.
>
> Yes, this is where I started--with valued reservaton stations.
>
>> With valueless Reservation Stations, or the everything-in-ROB approach,
>> then, yes, the constant has to be held in the ROB until the uOp is issued
>> to function unit, and that requires extra ROB read ports at issue
>> and muxes to route the value to the operand buses.
>
> The constant has a peculular property--it is known to be used-once !!
> Thus it could be put somewhere other than ROB.
> {Completely sidestepping the fact that the design will not have an ROB}
Yeah, and a ROB-centralized designs put everything in one place,
then complains about having so many ports on the ROB file.
Doctor, doctor, it hurts when I do this....
>> I considered an alternate approach, with a circular buffer attached to
>> Decode where constants are temporarily stashed and the buffer index
>> passed as a uOp argument, which eliminates them from the central ROB.
>> But I decided to have valued R.S. so it wasn't necessary.
>> Also, thinking about this in retrospect, that circular buffer would
>> have to be part of checkpoint-rollback so would add more complexity.
>
> There are lots of good microarchitectural choices in this less than optimally
> explored design space; and a few poor choices, too.
>
>> But valued R.S. solved that problem plus gave a place to hold forwarded
>> operand values, which eliminates having to re-read those values from
>> register file or ROB at issue time and the associated ports,
>> and those same R.S. registers were used to store the forwarding
>> match tag bits so it didn't really cost anything.
>
> Yes, yes.
>
> Can I ask:: Did you manage to put the stations in the pipeline without giving
> them a clock?
No, the RS are clocked, though I was still hoping to use latches
rather than FF. The critical path is forwarded values for
back-to-back instructions. Latches allows a forwarded value to
flow through from the result bus, through an RS, directly into the
calculation unit, then the RS latch closes holding the value.
The timing of this RS sample-hold action would be dependent on the
propagation delay on the result bus, I'm guessing 1/4 cycle delay
after the result is broadcast.
> In the Mc 88120, we could issue instructions into the reservation station.
> When the station was not alredy issuing an instruction, we allowed the issuing
> instruction to proceede directly into execution (assuming it had its operands).
> So, in effect, the RS did not have a stage in the pipe, but were simply feedback
> paths much like forwarding (but with storage.)
This sounds like the same as I describe above.
Since this back-to-back forwarding is the critical path
I also came up with a neat logic hack idea if the propagation
delay through the RS pathway was too slow.
(This really requires a white board but I'll give it a shot anyway.)
Below all takes place within 1 clock cycle.
It starts at the rising edge of the clock,
and the calculation result is saved in a FF at the next rising edge
for 1 cycle operations like ALU.
If the path along the result bus, through the routing muxes,
through the RS, through more routing muxes, into the Calculation Unit (CU)
was too long, there could be a "fast path" and "slow path"
for operand routing.
The problem is how to switch the CU operand input from the
fast path to slow path without glitching.
Ordinarily a mux selects either the left or right input,
and it can glitch the output when switching between them.
An AND-OR mux circuit that would route a forwarded operand directly
into the CU operand so it can begin its calculation immediately.
The result would also route through the slow path, the RS latches,
the RS output bus, etc.
Later the RS latch closes to hold the forwarded value and puts it
onto the RS output bus leading to the CU.
Later the "fast path" mux enables the second path
from the RS latch while keeping fast result enabled.
Since the values on the fast and slow paths are the same and
it goes through an AND-OR mux, it OR's to values that are the same
so there shouldn't be a glitch while it switches paths.
Later the fast path input is disabled and the
RS holds the input operand to the CU.
At the rising edge of the next clock the CU result is clocked into
a result FF, and possible driven out onto the tri-state result bus
for storage in the register file or forwarding.