Collapsing buffers have major problems when instructions need to be replayed.
>
> Each LSQ entry is like a uOp except more complex as each has multiple
> states to sequence through with various Ready or Wait states, and multiple
> different schedulers selecting entries waiting for some operation.
<
In the case of Conditional Cache: the uOp can be fired to access lower in
the cache hierarchy, fired to send a now "in-order" LD to completion,
fired to send a ST with data to {D$ or lower in the hierarchy}, fired to
access the D$, and a bunch more.
<
> The LSQ has multiple ports for adding new entries at head,
> plus ports selected by each different scheduling operation.
<
I add them issue-width at a time coordinated with branch checkpoints
to make backup easier.
>
> If an address calculation is required, it is routed through the
> AGU Reservation Stations, which has its own AGU-RS scheduler and
> writes its result Virtual Address into the associated LSQ entry
> and set the entry's VA_Ready flag.
<
Note: A memref can AGEN multiple times under replay.
>
> When the LSQ entry has a valid virtual address and its VA_Ready flag
> is set it requests translation, which the translate scheduler
> (a circular priority selector) sees and reads its VA through a LSQ port
> to the translate/table_walk machine. The entry waits until translate
> stores the 1 (or 2 if a straddle) physical addresses
> and sets the PA_Ready flag, or writes an error code.
> (It it is a translate error then the entry follows a different course.)
<
When an entry has VA but not PA it can neither complete, not access lower
in the hierarchy. Depending on the characteristics of the D$ it may or may
not access D$.
>
> The table walker is one or more state machines that bypasses the LSQ,
> arbitrates with LSQ for the bus to talk to the cache. It can read and
> write PTE's from valid cache lines, and multiple walkers might have
> multiple cache misses outstanding at once. Because walkers bypass the
> LSQ and talk straight to cache they don't get stuck behind stalled
> LSQ entries. But it also requires an LSQ flush after writes to PTE's.
<
I always bypassed L1 caches when tablewalking accessing typically L2.
<
In Mc 88120 we did an experiment on multiple simultaneous tablewalks
and found that complexity wanting. 64-bit VA and PA may have changed
that situation.
>
> When an entry's PA_Ready flag is set, a scheduler (again a circular
> priority selector) sees it and selects the oldest entry,
<
Oldest entry that has no pending dependencies.
<
> which is passed to PA_Disambiguation to separate access by cache line,
> so that all loads and stores to the same line are performed in order,
<
Mc 88120 could perform 3 LDs and 3 STs to the came cache line
simultaneously, independent of the LD-ST orders. No data got lost
and no LD got data younger than it was supposed to see. This was
part of what MDM provided. So, while only 3 AGENs could be performed
several other MemRefs could be fired out of CC while AGENing was
going on.
<
> but separate cache lines accesses can be performed in any order
> (that last statement is weaker than x86-TSO coherence but
> could be more restricted to all stores performed in order).
<
Mc 88120 was restricted by 4-bank D$ and direct mapping used
VA address bits (16KB cache, 4KB pages) CC was not restricted.
Also note, Packet$ misses were satisfied through D$ (making it
a U$)
>
> PA_Disambiguation is where your 48*48 dependency matrix comes into
> play, as it decides when an entry can actually talk to cache.
<
Access D$ as quickly as possible. Decide if LD can deliver its
result later based on MDM. {Remember LDs can fire from CC
as dependencies resolve.} When SD dependencies resolve
STs can write to D$ or lower levels of hierarchy.
<
> When the dependency matrix indicates no older ancestors
> it sets the entry's PA_Allowed flag to enable memory access.
<
We never "set" anything, but when dependencies resolved MDM did assert
essentially the same signal as what you implied we set. The wording is
different because replay (K9) could reAGEN and we had to go back to
original dependencies. '120 backed up on this case (0 cycles) but K9
did not have instantaneous backup like '120.
<
> This is where the Store Buffer and store-to-load forwarding occurs.
> (I'm not sure where multiple byte store merging should take place.)
<
As a LD accesses CC, each address matching ST that has written data
asserts, and we use a age mask and find firsts to sort out which data is
accumulated for the LD.
>
> When the entry's PA_Allowed flag is set, a circular scheduler sees it
> and arbitrates for the LSQ-D$L1 cache access bus. On grant it
> sends its load PA-address or store PA-address and data to cache.
<
We read D$ (when no bank conflicts) and put a line into CC (128 bits) in
'120. IN K9 we read sub-lines into the similar mechanism. So each CC
had a complete set of data (although it had byte masks to sort out
various difficulties). We just had to decide which entry was to supply
which piece of data. '120 copied forward after AGEN, K9 did not. Both
sets of logic were of similar complexity.
>
> (This is the part that is different from how Jouppi handles cache misses.)
> It also sends a one-hot bit mask with a bit set indicating which
> LSQ entry sourced the operation. This is the wake-up ID if needed.
<
'120 wakeup ID was the relaxed MDM on an entry and the entry being
in any undelivered but completed state; and of course, that checkpoint
having passed the "consistent point".
>
> D$L1 checks the cache and Miss Buffers for the PA.
> If it is a cache hit then it proceeds as normal read or write and ACK'd.
<
Misses were played out "from" the CC. Data was returned with address
and if anyone in the CC wanted the data, it would be snarfed in. This
enabled an external party to throw data at the CPU and if timing was
acceptable, the access would see 0 added latency. Bell Northern was
going to use this to feed instructions to the CPU before the CPU got
around to wanting them.
>
> If a cache miss then the Miss Buffer CAM is checked to see if
> already outstanding. If not a MB is allocated (or returns a
> stall signal if none available), and the wake-up mask is OR'd.
> In this example there were 48 LSQ entries so each MB has a 48 bit mask.
> The cache returns a Miss signal to LSQ which changes the state
> of the source LSQ entry from PA_Allowed to PA_Allowed_Wait,
> which removes it temporarily from the schedule.
<
'120 and K9 had no miss buffer (other than CC).
>
> When the missed line returns from $L2 it checks the MB CAM and hits.
> It merges the new line with any stashed store bytes,
> and moves the merged line to the D$ banks.
<
Don't write D$ wile data is in CC, wait until instruction retires.....
>
> The wake-up mask is send from D$L1 to LSQ over the 48-bit wake-up bus,
> which changes the state of ALL those LSQ entries back to PA_Allowed
> in 1 clock (this eliminates any sequential sequencing from wake-up).
<
'120 and K9 had the MDM attached to the CC in such a way that his bus
was no more than a set of wires no longer than about 6 gates of length.
{You can't really call it a bus at that point--just 48 short wires.}
>
> The LSQ-D$_cache scheduler sees those entries again and replays it,
> this time gets a load hit, D$L1 returns the data to the LSQ entry,
> which stores the data in the entry and changes its state to Data_Valid.
> The write-back scheduler selects a Data_Valid entry
> and arbitrates for a data write-back/forwarding bus.
>
> When load data has be sent, the LSQ entry is marked Complete.
> Later a LSQ entry Retire unit move up the circular buffer tail
> freeing the LSQ entry for reuse.
>
> Whew...
<
Whew, indeed. The CC was one of my more interesting "adventures" in
computer architecture. There was the VA CAM, the PA CAM, the DATA
store, the MDM, and several schedulers (circular FF1s.) The way it was
designed resulted in a circuit that was only 6-wires wide !!!