Partially valid cache lines (per line byte mask)?

Marcus

unread,

Jun 2, 2022, 6:43:03 AM6/2/22

to

About data caches (I have never made one, so trying to learn)...

Would this "partially valid" data cache paradigm be useful? Is it common
practice?

When an instruction writes a partial cache line (e.g. a single word, or
perhaps a single byte), and the cache line does not exist in the cache
(i.e. there's a write miss), one could potentially write the partial
cache line to the cache without first fetching the full cache line from
memory.

Rather than having a single "valid" flag for the cache line, one would
need a valid flag for each byte in the cache line (i.e. a byte mask).

If the cache line is evicted it would be flushed to memory along with
the byte mask (here I'm assuming that every level in the memory
hierarchy has support for byte-masked writes).

Checking for a cache read hit would require comparing the byte mask of
the read operation with the byte mask of the cache line.

If there's a "partial" cache read miss one would have to fetch the full
cache line from memory and apply the dirty bytes (or use the inverted
cache line byte mask when writing the cache line from memory to the data
cache).

/Marcus

Niklas Holsti

unread,

Jun 2, 2022, 7:31:21 AM6/2/22

to

On 2022-06-02 13:42, Marcus wrote:
> About data caches (I have never made one, so trying to learn)...
>
> Would this "partially valid" data cache paradigm be useful? Is it common
> practice?

I believe that the LEON2 processor (a SPARC processor used in space
applications) has a data cache that has a "valid" bit per 32-bit word
(but not per 8-bit byte). The reason is probably that the memory
interface is narrow -- 32 bits -- so loading an entire cache line, on
cache miss, would take many memory cycles. Loading just the missing word
(or doubleword) takes less time.

The proposed Mill processor architecture has per-byte flags:
https://millcomputing.com/docs/#memory.

> When an instruction writes a partial cache line (e.g. a single word, or
> perhaps a single byte), and the cache line does not exist in the cache
> (i.e. there's a write miss), one could potentially write the partial
> cache line to the cache without first fetching the full cache line from
> memory.

The LEON2 data cache is write-through, without (I believe) allocating a
line on write-miss, so that question does not arise.

The Mill works as you suggest, in fact write-misses cannot happen.

Michael S

unread,

Jun 2, 2022, 8:25:36 AM6/2/22

to

On Thursday, June 2, 2022 at 1:43:03 PM UTC+3, Marcus wrote:
> About data caches (I have never made one, so trying to learn)...
>
> Would this "partially valid" data cache paradigm be useful?

Only if you are sure that you never ever will want SMP and if you are sure
that you are willing to manage by hand coherence of memory that is shared with
various DMAs.

> Is it common practice?
>

Depends, in which part of computing world.
In general purpose cores, from the smallest Cortex-A and up - no.
In deep embedded, you probably can find it, esp. in old stuff.
Newer deep embedded tends to be either totally cacheless on the data side,
like Arm Cortex M0/3/4, or to use the same coherence principles as "big guys"
which require for cache line to be either here or there, never both and both.

> When an instruction writes a partial cache line (e.g. a single word, or
> perhaps a single byte), and the cache line does not exist in the cache
> (i.e. there's a write miss), one could potentially write the partial
> cache line to the cache without first fetching the full cache line from
> memory.
>
> Rather than having a single "valid" flag for the cache line, one would
> need a valid flag for each byte in the cache line (i.e. a byte mask).
>
> If the cache line is evicted it would be flushed to memory along with
> the byte mask (here I'm assuming that every level in the memory
> hierarchy has support for byte-masked writes).

ECC memory controller can provide byte enables on the host-facing interface,
but on the memory-facing side it will have to do read-modify-write.

Marcus

unread,

Jun 2, 2022, 9:27:44 AM6/2/22

to

On 2022-06-02, Niklas Holsti wrote:
> On 2022-06-02 13:42, Marcus wrote:
>> About data caches (I have never made one, so trying to learn)...
>>
>> Would this "partially valid" data cache paradigm be useful? Is it common
>> practice?
>
>
> I believe that the LEON2 processor (a SPARC processor used in space
> applications) has a data cache that has a "valid" bit per 32-bit word
> (but not per 8-bit byte). The reason is probably that the memory
> interface is narrow -- 32 bits -- so loading an entire cache line, on
> cache miss, would take many memory cycles. Loading just the missing word
> (or doubleword) takes less time.
>
> The proposed Mill processor architecture has per-byte flags:
> https://millcomputing.com/docs/#memory.

Thanks for the reference! The juicy bits are at 51:30 in the talk:

https://youtu.be/bjRDaaGlER8?t=3090

Marcus

unread,

Jun 2, 2022, 9:52:58 AM6/2/22

to

On 2022-06-02, Michael S wrote:
> On Thursday, June 2, 2022 at 1:43:03 PM UTC+3, Marcus wrote:
>> About data caches (I have never made one, so trying to learn)...
>>
>> Would this "partially valid" data cache paradigm be useful?
>
> Only if you are sure that you never ever will want SMP and if you are sure
> that you are willing to manage by hand coherence of memory that is shared with
> various DMAs.

Could you please elaborate why this is the case? For instance, write-
back caches also suffer from DMA coherence issues, right?

>
>> Is it common practice?
>>
>
> Depends, in which part of computing world.
> In general purpose cores, from the smallest Cortex-A and up - no.
> In deep embedded, you probably can find it, esp. in old stuff.
> Newer deep embedded tends to be either totally cacheless on the data side,
> like Arm Cortex M0/3/4, or to use the same coherence principles as "big guys"
> which require for cache line to be either here or there, never both and both.
>
>> When an instruction writes a partial cache line (e.g. a single word, or
>> perhaps a single byte), and the cache line does not exist in the cache
>> (i.e. there's a write miss), one could potentially write the partial
>> cache line to the cache without first fetching the full cache line from
>> memory.
>>
>> Rather than having a single "valid" flag for the cache line, one would
>> need a valid flag for each byte in the cache line (i.e. a byte mask).
>>
>> If the cache line is evicted it would be flushed to memory along with
>> the byte mask (here I'm assuming that every level in the memory
>> hierarchy has support for byte-masked writes).
>
> ECC memory controller can provide byte enables on the host-facing interface,
> but on the memory-facing side it will have to do read-modify-write.
>

I'm not familiar with all popular RAM types, but for SDRAM modules
it appears to be common to have a byte mask ("data mask", "DQM" etc).

Ivan Godard

unread,

Jun 2, 2022, 10:13:17 AM6/2/22

to

On 6/2/2022 6:52 AM, Marcus wrote:
> On 2022-06-02, Michael S wrote:
>> On Thursday, June 2, 2022 at 1:43:03 PM UTC+3, Marcus wrote:
>>> About data caches (I have never made one, so trying to learn)...
>>>
>>> Would this "partially valid" data cache paradigm be useful?
>>
>> Only if you are sure that you never ever will want SMP and if you are
>> sure
>> that you are willing to manage by hand coherence of memory that is
>> shared with
>> various DMAs.
>
> Could you please elaborate why this is the case? For instance, write-
> back caches also suffer from DMA coherence issues, right?

Yes. Both Mill and Mitch's My66 treat the problem by considering a DMA
that must be coherent as just another core - the DMA transfers are
cached normally at the innermost writeback level. Mill (don't know about
My66) also has incoherent DMA, where a region of the address space can
be handed to an active agent (core or DMA) which thereafter owns it and
does not participate in coherence. This models a supercomputer-style
message passing interface.

Ivan Godard

unread,

Jun 2, 2022, 10:15:24 AM6/2/22

to

On 6/2/2022 6:27 AM, Marcus wrote:
> On 2022-06-02, Niklas Holsti wrote:
>> On 2022-06-02 13:42, Marcus wrote:
>>> About data caches (I have never made one, so trying to learn)...
>>>
>>> Would this "partially valid" data cache paradigm be useful? Is it common
>>> practice?
>>
>>
>> I believe that the LEON2 processor (a SPARC processor used in space
>> applications) has a data cache that has a "valid" bit per 32-bit word
>> (but not per 8-bit byte). The reason is probably that the memory
>> interface is narrow -- 32 bits -- so loading an entire cache line, on
>> cache miss, would take many memory cycles. Loading just the missing
>> word (or doubleword) takes less time.
>>
>> The proposed Mill processor architecture has per-byte flags:
>> https://millcomputing.com/docs/#memory.
>
> Thanks for the reference! The juicy bits are at 51:30 in the talk:
>
> https://youtu.be/bjRDaaGlER8?t=3090
>

We have patents in this area, but would be happy to give you a free
license if you want to experiment.

Ivan Godard

unread,

Jun 2, 2022, 10:19:48 AM6/2/22

to

On 6/2/2022 5:25 AM, Michael S wrote:
> On Thursday, June 2, 2022 at 1:43:03 PM UTC+3, Marcus wrote:
>> About data caches (I have never made one, so trying to learn)...
>>
>> Would this "partially valid" data cache paradigm be useful?
>
> Only if you are sure that you never ever will want SMP and if you are sure
> that you are willing to manage by hand coherence of memory that is shared with
> various DMAs.

SMP works fine. DMA has the same problems that it always has.

>> Is it common practice?
>>
>
> Depends, in which part of computing world.
> In general purpose cores, from the smallest Cortex-A and up - no.
> In deep embedded, you probably can find it, esp. in old stuff.
> Newer deep embedded tends to be either totally cacheless on the data side,
> like Arm Cortex M0/3/4, or to use the same coherence principles as "big guys"
> which require for cache line to be either here or there, never both and both.
>

Yes, that's the problem with legacy coherence protocols: ping-ponging
and false sharing. But when you think about it, what you really want is
for a *byte* to be either here or there, and *lines* are implementation
artifacts.

Marcus

unread,

Jun 2, 2022, 11:00:23 AM6/2/22

to

Thanks (great talk BTW)! What parts are patented?

The things I'm thinking about feel like "trivial" stuff. E.g. having
per-byte valid flags can roughly be thought of as having a cache line
that is one byte in size, except that you work with groups of "lines" in
certain situations (e.g. for the cache tag) for increased efficiency.
Likewise, propagating the byte mask via the cache line instead of
getting rid of it when committing a write buffer/queue item to the cache
(write-back) or passing it along all the way to RAM (write-through)
doesn't feel like a far-fetched idea, IMO.

But I guess that the devil is in the details (and I haven't gotten that
far yet).

/Marcus

Marcus

unread,

Jun 2, 2022, 11:08:46 AM6/2/22

to

On 2022-06-02, Marcus wrote:
> On 2022-06-02, Ivan Godard wrote:
>> On 6/2/2022 6:27 AM, Marcus wrote:
>>> On 2022-06-02, Niklas Holsti wrote:
>>>> On 2022-06-02 13:42, Marcus wrote:
>>>>> About data caches (I have never made one, so trying to learn)...
>>>>>
>>>>> Would this "partially valid" data cache paradigm be useful? Is it
>>>>> common
>>>>> practice?
>>>>
>>>>
>>>> I believe that the LEON2 processor (a SPARC processor used in space
>>>> applications) has a data cache that has a "valid" bit per 32-bit
>>>> word (but not per 8-bit byte). The reason is probably that the
>>>> memory interface is narrow -- 32 bits -- so loading an entire cache
>>>> line, on cache miss, would take many memory cycles. Loading just the
>>>> missing word (or doubleword) takes less time.
>>>>
>>>> The proposed Mill processor architecture has per-byte flags:
>>>> https://millcomputing.com/docs/#memory.
>>>
>>> Thanks for the reference! The juicy bits are at 51:30 in the talk:
>>>
>>> https://youtu.be/bjRDaaGlER8?t=3090
>>>
>>
>> We have patents in this area, but would be happy to give you a free
>> license if you want to experiment.
>
> Thanks (great talk BTW)! What parts are patented?

Ok, I think I found the patent you were referring to (US 9,513,904 B2,
"Computer processor employing cache memory with per-byte valid bits"):

https://millcomputing.com/blog/wp-content/uploads/2016/10/uspto9513904.pdf

Will give it a read.

MitchAlsup

unread,

Jun 2, 2022, 12:40:04 PM6/2/22

to

On Thursday, June 2, 2022 at 5:43:03 AM UTC-5, Marcus wrote:
> About data caches (I have never made one, so trying to learn)...
>
> Would this "partially valid" data cache paradigm be useful? Is it common
> practice?
<

No, in general, many designers think the 12.5% size overhead is excessive.

>
> When an instruction writes a partial cache line (e.g. a single word, or
> perhaps a single byte), and the cache line does not exist in the cache
> (i.e. there's a write miss), one could potentially write the partial
> cache line to the cache without first fetching the full cache line from
> memory.
<

What about cache coherence ? What if 2 processors write the same byte
at the same time ? The real question is what semantic do you ascribe to
memory:: does a memory container always contain the last value written ?
And how do you define last ?
<
Basically, in a coherent system, you need write permission before performing
the write--and obtaining permission is at least as expensive as just getting
the whole line.
<
The only time My 66000 allows one to write prior to the line showing up
is when it can be determined that the entire line (every byte) is written
and thereby every copy around the system is to be invalid. This happens
in Vectorized loops and in MM (Memory to Memory Move.)
<
Also Note: When the size of MM is less than or equal to a page, the
entire data movement is atomic as seen everywhere else in the system.

>
> Rather than having a single "valid" flag for the cache line, one would
> need a valid flag for each byte in the cache line (i.e. a byte mask).
>
> If the cache line is evicted it would be flushed to memory along with
> the byte mask (here I'm assuming that every level in the memory
> hierarchy has support for byte-masked writes).
<

What about cache to cache transfers ? What if several processors have written
several bytes to the cache line ? how do you remember which one went first?
How do you reassemble the line of data ?

>
> Checking for a cache read hit would require comparing the byte mask of
> the read operation with the byte mask of the cache line.
>

How much data do you read if you need 4 bytes and have only 1 ?

>
> If there's a "partial" cache read miss one would have to fetch the full
> cache line from memory and apply the dirty bytes (or use the inverted
> cache line byte mask when writing the cache line from memory to the data
> cache).
<

What if DRAM supplies 1 byte, processor 2 supplies 2 bytes, and processor
2 supplies 1 byte ?
<
At this point, it is unlikely that you have saved any time or bought any
performance.
>
> /Marcus

MitchAlsup

unread,

Jun 2, 2022, 12:42:32 PM6/2/22

to

On Thursday, June 2, 2022 at 8:52:58 AM UTC-5, Marcus wrote:
> On 2022-06-02, Michael S wrote:
> > On Thursday, June 2, 2022 at 1:43:03 PM UTC+3, Marcus wrote:
> >> About data caches (I have never made one, so trying to learn)...
> >>
> >> Would this "partially valid" data cache paradigm be useful?
> >
> > Only if you are sure that you never ever will want SMP and if you are sure
> > that you are willing to manage by hand coherence of memory that is shared with
> > various DMAs.
> Could you please elaborate why this is the case? For instance, write-
> back caches also suffer from DMA coherence issues, right?
<

DMA is conceptually no different than a processor.
However, DMA is seldom "as coherent" as a processor.

> >
> >> Is it common practice?
> >>
> >
> > Depends, in which part of computing world.
> > In general purpose cores, from the smallest Cortex-A and up - no.
> > In deep embedded, you probably can find it, esp. in old stuff.
> > Newer deep embedded tends to be either totally cacheless on the data side,
> > like Arm Cortex M0/3/4, or to use the same coherence principles as "big guys"
> > which require for cache line to be either here or there, never both and both.
> >
> >> When an instruction writes a partial cache line (e.g. a single word, or
> >> perhaps a single byte), and the cache line does not exist in the cache
> >> (i.e. there's a write miss), one could potentially write the partial
> >> cache line to the cache without first fetching the full cache line from
> >> memory.
> >>
> >> Rather than having a single "valid" flag for the cache line, one would
> >> need a valid flag for each byte in the cache line (i.e. a byte mask).
> >>
> >> If the cache line is evicted it would be flushed to memory along with
> >> the byte mask (here I'm assuming that every level in the memory
> >> hierarchy has support for byte-masked writes).
> >
> > ECC memory controller can provide byte enables on the host-facing interface,
> > but on the memory-facing side it will have to do read-modify-write.
> >
> I'm not familiar with all popular RAM types, but for SDRAM modules
> it appears to be common to have a byte mask ("data mask", "DQM" etc).
<

Memory manufactures add byte masks to allow for special cases,
CPUs, especially higher end CPUs seldom used those features.

MitchAlsup

unread,

Jun 2, 2022, 12:46:34 PM6/2/22

to

The coherence (or lack thereof) is determined by the mapping tables
(just like CPU accesses). It is expected that DMA uses virtual address
space of the "originating request" and that the I/O driver simply informs
HostBridge of mapping table address. Thus, I/O driver simply passes
the virtual address of the originating request to DMA device.

EricP

unread,

Jun 2, 2022, 2:39:10 PM6/2/22

to

MitchAlsup wrote:
> On Thursday, June 2, 2022 at 5:43:03 AM UTC-5, Marcus wrote:
>> About data caches (I have never made one, so trying to learn)...
>>
>> Would this "partially valid" data cache paradigm be useful? Is it common
>> practice?
> <
> No, in general, many designers think the 12.5% size overhead is excessive.

That's for each nodes local cache.
In the home directory, which tracks which node owns each line,
it would now have to track which node owned each byte.
Say 64 nodes = 6-bit node ID, 64 byte line,
= 384 extra status bit per line.

MitchAlsup

unread,

Jun 2, 2022, 2:57:41 PM6/2/22

to

More if the architecture has an unbounded number of nodes per system.
<
Then, is a "node" a CPU core or a chip containing several handfuls of cores ?

MitchAlsup

unread,

Jun 2, 2022, 2:59:15 PM6/2/22

to

To OP: Are you going to keep the normal 3-cache line status bits on a per byte
basis ? {inValid, Readable, Exclusive, Shared, Modified}

MitchAlsup

unread,

Jun 2, 2022, 7:03:10 PM6/2/22

to

It seems to me that you get essentially everything you think you are trying
to get by having a buffer where several cache lines can have multiple partial
writes already performed WHILE the miss buffer is sending out cache miss
requests and awaiting return of data. Your pending writes are performed as
the miss buffer data is written into the actual cache.
<
And this way, you don't screw up neither the pipeline nor the ache coherence
semantics.

Ivan Godard

unread,

Jun 2, 2022, 11:05:16 PM6/2/22

to

On 6/2/2022 9:40 AM, MitchAlsup wrote:
> On Thursday, June 2, 2022 at 5:43:03 AM UTC-5, Marcus wrote:
>> About data caches (I have never made one, so trying to learn)...
>>
>> Would this "partially valid" data cache paradigm be useful? Is it common
>> practice?
> <
> No, in general, many designers think the 12.5% size overhead is excessive.
>>
>> When an instruction writes a partial cache line (e.g. a single word, or
>> perhaps a single byte), and the cache line does not exist in the cache
>> (i.e. there's a write miss), one could potentially write the partial
>> cache line to the cache without first fetching the full cache line from
>> memory.
> <
> What about cache coherence ? What if 2 processors write the same byte
> at the same time ? The real question is what semantic do you ascribe to
> memory:: does a memory container always contain the last value written ?
> And how do you define last ?
> <
> Basically, in a coherent system, you need write permission before performing
> the write--and obtaining permission is at least as expensive as just getting
> the whole line.

Actually not true, although widely believed. Write ownership is just an
implementation of a consistency model
(https://en.wikipedia.org/wiki/Consistency_model). One-at-a-time does
impose an ordering - but there are other ways to impose orderings, and
it's the order that matters.

Mill coherence is NYF, but it only works when you *don't* hold bytes you
haven't written for ransom.

Ivan Godard

unread,

Jun 2, 2022, 11:05:45 PM6/2/22

to

Nope :-)

Ivan Godard

unread,

Jun 2, 2022, 11:31:00 PM6/2/22

to

Again, this begs the issue - it assumes a particular implementation. I
grant that implementation would be wasteful when using valid bits -
which is why we don't do i that way. But it's not the only way to
provide a consistent model.

EricP

unread,

Jun 3, 2022, 1:19:03 AM6/3/22

to

Upon further consideration, I disagree with myself too. :-)

Marcus

unread,

Jun 3, 2022, 2:16:37 AM6/3/22

to

I don't see why that would be necessary. "Modified" would be per cache
line. "Valid" would be per byte.

My current use case is a single in-order core, so coherency is less of
an issue. Stalls due to memory accesses are more of an issue. As I said,
I have never made a D$ before, so I'm currently toying with different
ideas.

I saw two common classes of problems that I wanted to solve related to
potential write misses that are "false" (in the sense that whatever was
stored in RAM before the write operations is irrelevant since it will
never be read):

- Stack variables / spill.
- Writing to an array byte-by-byte.

Having a sufficiently deep write queue (to hide write misses) with
sufficiently advanced read-back logic would solve the problem.

Then it occurred to me that storing partial cache lines with byte-enable
would essentially solve the same problem, without having to have an
advanced write queue. Remember, it's in-order and I don't do any
speculative execution (except for branch prediction, but that's only
speculative up-to-and-including register fetch), so there are no
speculative writes at this point.

/Marcus

MitchAlsup

unread,

Jun 3, 2022, 11:58:43 AM6/3/22

to

Any DMA remains a problem. What do you do if 3 separated bytes are
in the processor cache and 13 are in DRAM ? when a DMA read line needs
to be performed ?

>
> I saw two common classes of problems that I wanted to solve related to
> potential write misses that are "false" (in the sense that whatever was
> stored in RAM before the write operations is irrelevant since it will
> never be read):
>
> - Stack variables / spill.
<

My 66000 has a special dispensation when using safe-stack that writes
to the safe-stack do not need to fetch memory data (it is known to have
just gone stale) and is known to not need to make memory on eviction
(when the S-SP has a greater value than the line being evicted).

<
> - Writing to an array byte-by-byte.
<

My 66000 vectorization model allows byte-by-byte stores to be observed
larger than a single instruction. When it can be determined that an entire
line (or more) will be written, the line is not fetched, but simply invalidated
in all other caches.

>
> Having a sufficiently deep write queue (to hide write misses) with
> sufficiently advanced read-back logic would solve the problem.
>
> Then it occurred to me that storing partial cache lines with byte-enable
> would essentially solve the same problem, without having to have an
> advanced write queue. Remember, it's in-order and I don't do any
<

Remember all DMA ruins your interpretation of in-order.

<
> speculative execution (except for branch prediction, but that's only
> speculative up-to-and-including register fetch), so there are no
> speculative writes at this point.
<

There is still DMA.
>
> /Marcus

EricP

unread,

Jun 3, 2022, 12:21:19 PM6/3/22

to

This is partially handled by what Mitch referred to earlier as
Miss Buffers and others refer to as Miss Status Holding Registers.
It not only allows non-blocking hit-under-miss loads,
it also gives the Load Store Queue a place to stash write bytes
when there is a write miss, and can source bytes if later read back
before the write-miss completes (e.g. a quick push-pop).

Complexity-Performance Tradeoffs with Non-blocking Loads 1994
http://courses.cs.washington.edu/courses/csep548/00sp/handouts/lockup_free.pdf

There is also various design options for cache resource management:

Cache Write Policies and Performance 1993
https://home.gwu.edu/~mlancast/CS6461Reference/Cache/12_write.pd.pdf

A search for "non blocking cache" gets lots of hits.
It also matters whether this is for ASIC or FPGA as the later
do not have CAMs available so need different designs.

This makes the loads and stores a kind of Async IO.

It doesn't eliminate the need for a coherence protocol.
If you have multiple writers to the same byte then you need
to establish a global order to decide which write came first
and which came second because only the later value is retained.

MitchAlsup

unread,

Jun 3, 2022, 2:04:04 PM6/3/22

to

Modern semiconductor technologies do not have CAMs either--we
build them out of std gates.

>
> This makes the loads and stores a kind of Async IO.
>
> It doesn't eliminate the need for a coherence protocol.
> If you have multiple writers to the same byte then you need
> to establish a global order to decide which write came first
> and which came second because only the later value is retained.
<

Back in the Mc 88120 days, we used a buffer called the Conditional
cache, a temporally oriented cache and we built direct HW to handle
several important special cases.
<
Consider 2 ST instructions separated by a LD instruction where the
AGEN of the LD is dependent on a long running calculation. So,
both ST instructions are performed and drop their data into the CC.
Later the LD AGENs, and the conditional cache uses a time mask
to get data from the early store and ignore data from the later store
and the LD is completely satisfied from the CC and needs no access
to the D$.
<
Here, you don't even need to AGEN in program order and you still get
the proper data delivered to the LD.
<
The temporal organization we used was that each checkpoint had 3
places to "queue" memory reference instructions (because each issue
could issue a maximum of 3 memory refs). {You could also call this
a L0 cache}. Since the execution window was 16 checkpoints deep,
the CC had 48 lines (128-bits each) and so there was a entry for each
possible MR in the EW.
<
There was a 48-bit×48-bit memory dependency matrix which was used
to track dependencies. On issue, every LD was made dependent on every
prior ST. As LDs AGENed, any address mismatched relaxed the MDM
and when the MR was dependency free, it could be preformed. LDs could
fire out of the CC as if it were a reservation station, STs could fire out to
update the D$ after the consistent point when it had data. Both LDs and STs
which missed the CC would access the D$ and fill the CC upon hit. Misses
in the CC would be fired out to DRAM to fill.
<
CC provided a place for DRAM data, and ST data to "pend" while awaiting
other dependencies to relax. We went so far as to track collisions in D$ so
if a younger MR would displace a older MR in D$, the older MR was written
directly to DRAM allowing younger access to write to D$. {This got rid of
about ½ of the write traffic to the direct mapped D$}
<
While "in" CC any LD could hit and deliver data--even performing gather
when several writes were accumulated into a single result for a LD.
<
This was 1992...........

Ivan Godard

unread,

Jun 3, 2022, 10:32:32 PM6/3/22

to

On 6/3/2022 9:19 AM, EricP wrote:<skip>

> This makes the loads and stores a kind of Async IO.
>
> It doesn't eliminate the need for a coherence protocol.
> If you have multiple writers to the same byte then you need
> to establish a global order to decide which write came first
> and which came second because only the later value is retained.

"global order" is a technical term also known as strict consistency. It
is only a mathematical notion because it is not physically realizable.
"Sequential consistency" is the strongest consistency model that is
physically possible; that's what Mill provides. No common legacy chip is
sequentially consistent, although x86 is close.

The problem is fundamental, a property of Einsteinian mechanics; a
universe in which strict consistency were possible could also have
wormholes and time machines. It's been speculated (sorry, no cite) that
global order is possible in the vicinity of a black hole event horizon,
but the math is beyond me.

Wikipedia has a good introduction to consistency models in general and
the various kinds. The subject is subtle and counter-intuitive.

Ivan Godard

unread,

Jun 3, 2022, 10:45:54 PM6/3/22

to

And an elegant piece of work it was.

It also sounds remarkably like what the instruction scheduler does in
the Mill code gen a.k.a. specializer, except mostly at compile time; all
the dependency chaining in particular. Store snooping is the part the
hardware does at run time, and we use the victim buffers of the cache as
what you are calling a L0.

Anton Ertl

unread,

Jun 4, 2022, 4:20:39 AM6/4/22

to

Ivan Godard <iv...@millcomputing.com> writes:
>"global order" is a technical term also known as strict consistency. It
>is only a mathematical notion because it is not physically realizable.

Let's see:

<https://en.wikipedia.org/wiki/Consistency_model#Strict_consistency> says:

|Under this model, a write to a variable by any processor needs to be
|seen instantaneously by all processors.

I don't see that this is equivalent to "global order".

|The strict model diagram and non-strict model diagrams describe the
|time constraint - instantaneous. It can be better understood as though
|a global clock is present in which every write should be reflected in
|all processor caches by the end of that clock period. The next
|operation must happen only in the next clock period.

That's easy to implement: Just use a slow-enough clock; I don't know
why people like to bullshit us with Einstein when they talk about
consistency models. But anyway, we probably don't want memory systems
that are that slow.

>"Sequential consistency" is the strongest consistency model that is
>physically possible;

<https://en.wikipedia.org/wiki/Consistency_model#Sequential_consistency>:
|A write to a variable does not have to be seen instantaneously,
|however, writes to variables by different processors have to be seen
|in the same order by all processors.

This sounds much more like "global order" to me.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

EricP

unread,

Jun 4, 2022, 9:37:51 AM6/4/22

to

You are referring to multiple updates to multiple separate locations.
I am referring to updates to a single location as seen by all nodes,
which can be "multi-copy atomic" or not.

The reason I mentioned this was because given Marcus's desire to not
stall for writes, one might be tempted to consider a write-update
coherence protocol, and write-update is not multi-copy atomic and
that makes reasoning about concurrent programming much more difficult.

From ARM, which was originally not multi-copy atomic and now is:

Multi-copy atomicity:

In a multiprocessing system, writes to a memory location are
multi-copy atomic if the following conditions are both true:
- All writes to the same location are serialized,
meaning they are observed in the same order by all observers,
although some observers might not observe all of the writes.
- A read of a location does not return the value of a write
until all observers observe that write.

Writes that are not coherent are not multi-copy atomic.

--------------------------------
P1 P2 P3 P4
[x]=1 [x]=2 r1=[x] r2=[x]

If it is possible for P3 to read r1=1 and P4 to read r2=2 then
the system is _not_ multi-copy atomic.
If it is not possible then it is multi-copy atomic.

A write-invalidate coherence protocol is multi-copy atomic
because it ensures that (a) there is only one exclusive holder
and (b) only the exclusive holder can perform updates.
This establishes a global order of updates to that location
so that either P1 wrote X before P2 or P2 wrote X before P1.

A write-update protocol broadcasts updates to all other nodes,
and since each node applies those updates at its' own pace it allows
different nodes to see a single location with different values.
Write-update is not multi-copy atomic and requires memory barriers
that flush various queues to ensure all updates are applied
so that each node sees a location with the same value.

EricP

unread,

Jun 4, 2022, 12:32:56 PM6/4/22

to

EricP wrote:
>
> Multi-copy atomicity:
>
> In a multiprocessing system, writes to a memory location are
> multi-copy atomic if the following conditions are both true:
> - All writes to the same location are serialized,
> meaning they are observed in the same order by all observers,
> although some observers might not observe all of the writes.
> - A read of a location does not return the value of a write
> until all observers observe that write.
>
> Writes that are not coherent are not multi-copy atomic.
>
> --------------------------------
> P1 P2 P3 P4
> [x]=1 [x]=2 r1=[x] r2=[x]
>
> If it is possible for P3 to read r1=1 and P4 to read r2=2 then
> the system is _not_ multi-copy atomic.
> If it is not possible then it is multi-copy atomic.

This is not an example of what MCA does/does-not allow.
I'll try to come up with a better one.

MitchAlsup

unread,

Jun 4, 2022, 5:23:52 PM6/4/22

to

On Saturday, June 4, 2022 at 8:37:51 AM UTC-5, EricP wrote:
> Ivan Godard wrote:
> > On 6/3/2022 9:19 AM, EricP wrote:<skip>
> >
> >
> >> This makes the loads and stores a kind of Async IO.
> >>
> >> It doesn't eliminate the need for a coherence protocol.
> >> If you have multiple writers to the same byte then you need
> >> to establish a global order to decide which write came first
> >> and which came second because only the later value is retained.
> >
> > "global order" is a technical term also known as strict consistency. It
> > is only a mathematical notion because it is not physically realizable.
> > "Sequential consistency" is the strongest consistency model that is
> > physically possible; that's what Mill provides. No common legacy chip is
> > sequentially consistent, although x86 is close.
> >
> > The problem is fundamental, a property of Einsteinian mechanics; a
> > universe in which strict consistency were possible could also have
> > wormholes and time machines. It's been speculated (sorry, no cite) that
> > global order is possible in the vicinity of a black hole event horizon,
> > but the math is beyond me.
> >
> > Wikipedia has a good introduction to consistency models in general and
> > the various kinds. The subject is subtle and counter-intuitive.
> You are referring to multiple updates to multiple separate locations.
> I am referring to updates to a single location as seen by all nodes,
> which can be "multi-copy atomic" or not.
<

What do you think is an appropriate name for a system where memory to
to memory copies up to the size of a page are ATOMIC ?? That is, all interested
parties see the memory either before the destination page gets written or they
see it after all of the page has been written.
<
And BTW, this works for pages that are both cacheable and uncacheable.

EricP

unread,

Jun 4, 2022, 8:47:52 PM6/4/22

to

This might do:

Initially x=0,y=0

P1 P2 P3
x=1 while (x==0) r2=y
Pause(); r1=x
y=2

Possible: r1=0,r2=0 r1=1,r2=0 r1=1,r2=2

Impossible with write-invalidate (MCA) (y is causally dependent on x),
Possible with write-update (y is not causally dependent on x):
r1=0,r2=2

MitchAlsup

unread,

Jun 5, 2022, 10:59:30 AM6/5/22

to

On Saturday, June 4, 2022 at 7:47:52 PM UTC-5, EricP wrote:
> EricP wrote:
> > EricP wrote:
> >>
> >> Multi-copy atomicity:
> >>
> >> In a multiprocessing system, writes to a memory location are
> >> multi-copy atomic if the following conditions are both true:
> >> - All writes to the same location are serialized,
> >> meaning they are observed in the same order by all observers,
> >> although some observers might not observe all of the writes.
> >> - A read of a location does not return the value of a write
> >> until all observers observe that write.
> >>
> >> Writes that are not coherent are not multi-copy atomic.
> >>
> >> --------------------------------
> >> P1 P2 P3 P4
> >> [x]=1 [x]=2 r1=[x] r2=[x]
> >>
> >> If it is possible for P3 to read r1=1 and P4 to read r2=2 then
> >> the system is _not_ multi-copy atomic.
> >> If it is not possible then it is multi-copy atomic.
> >
> > This is not an example of what MCA does/does-not allow.
> > I'll try to come up with a better one.
> This might do:
>
> Initially x=0,y=0
>
> P1 P2 P3
> x=1 while (x==0) r2=y
<

I don't see how r2=y is ever executed........especially is x is shared across processes.

<
> Pause(); r1=x
> y=2
>
> Possible: r1=0,r2=0 r1=1,r2=0 r1=1,r2=2
<

So, we have r2 = undefined.

EricP

unread,

Jun 5, 2022, 11:28:19 AM6/5/22

to

Pause(); r1=x
y=2

P2 spin-waits while x==0.
When P1 writes x=1 it releases P2 to write y=2.

If there is only one value for x in a system at a time,
as a write-invalidate protocol enforces,
then P3 should never read y==2 and later read x==0 because
old x is invalidated from the whole system before a new x is written.

However a write-update protocol could allow the new value
of y to arrive at P3 before the new value of x
so P3 might read y==2 and later read x==0 because
the old x is not invalidated just overwritten in the fullness of time.
To stop this from happening requires memory barriers on P1,P2,P3
to ensure all outbound and inbound queues are flushed.

EricP

unread,

Jun 5, 2022, 12:05:17 PM6/5/22

to

MitchAlsup wrote:
> <
> What do you think is an appropriate name for a system where memory to
> to memory copies up to the size of a page are ATOMIC ?? That is, all interested
> parties see the memory either before the destination page gets written or they
> see it after all of the page has been written.
> <
> And BTW, this works for pages that are both cacheable and uncacheable.

Blastatronic technology, now with page Atom-o-copy!

MitchAlsup

unread,

Jun 5, 2022, 1:34:42 PM6/5/22

to

On Sunday, June 5, 2022 at 10:28:19 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, June 4, 2022 at 7:47:52 PM UTC-5, EricP wrote:
> >> This might do:
> >>
> >> Initially x=0,y=0
> >>
> >> P1 P2 P3
> >> x=1 while (x==0) r2=y
> > <
> > I don't see how r2=y is ever executed........especially is x is shared across processes.
> > <
> >> Pause(); r1=x
> >> y=2
> >>
> >> Possible: r1=0,r2=0 r1=1,r2=0 r1=1,r2=2
> > <
> > So, we have r2 = undefined.
> >> Impossible with write-invalidate (MCA) (y is causally dependent on x),
> >> Possible with write-update (y is not causally dependent on x):
> >> r1=0,r2=2
>
>
> Initially x=0,y=0
>
> P1 P2 P3
> x=1 while (x==0) r2=y
> Pause(); r1=x
> y=2
>
> P2 spin-waits while x==0.
> When P1 writes x=1 it releases P2 to write y=2.
<

It looked to me that you performed the 'x=1 while (x==0) r2=y' on all 3 processors.

MitchAlsup

unread,

Jun 5, 2022, 1:35:14 PM6/5/22

to

EricP

unread,

Jun 7, 2022, 11:15:14 AM6/7/22

to

MitchAlsup wrote:
> On Friday, June 3, 2022 at 11:21:19 AM UTC-5, EricP wrote:
>> This is partially handled by what Mitch referred to earlier as
>> Miss Buffers and others refer to as Miss Status Holding Registers.
>> It not only allows non-blocking hit-under-miss loads,
>> it also gives the Load Store Queue a place to stash write bytes
>> when there is a write miss, and can source bytes if later read back
>> before the write-miss completes (e.g. a quick push-pop).
>>
>> Complexity-Performance Tradeoffs with Non-blocking Loads 1994
>> http://courses.cs.washington.edu/courses/csep548/00sp/handouts/lockup_free.pdf
>>
>> There is also various design options for cache resource management:
>>
>> Cache Write Policies and Performance 1993
>> https://home.gwu.edu/~mlancast/CS6461Reference/Cache/12_write.pd.pdf
>>
>> A search for "non blocking cache" gets lots of hits.
>> It also matters whether this is for ASIC or FPGA as the later
>> do not have CAMs available so need different designs.
> <
> Modern semiconductor technologies do not have CAMs either--we
> build them out of std gates.

Ok, but they are still used for things like Miss Buffer indexes.
For ASIC there just isn't a pre-packaged optimized CAM standard cell
that you can plop into your design.

In one article I read they compared the access latency of
various sizes of CAM to $L1. They found they could access a
CAM with 16 rows just slightly faster than a cache line
but a 32 row CAM took 50% longer than a cache line.

In that case it sets the maximum number of miss buffers at 16.
Victim buffers also need indexing if one wants to be able to
pull back a victim so similar size limits apply.

This is the store buffer for store to load forwarding and
is prior to Retire, right? I have difficulty seeing how that
can be implemented without a CAM. Maybe there could be separate
banks of smaller CAMs so one can keep the size to 16 rows max.

On some related questions about LSQ design, when the oldest store retires
if it gets a $L1 cache miss it is not clear whether it is better to
get the uOp out of the LSQ and free up a slot, or to hold the uOp
and its data in the LSQ until the miss completes and then replay
the missed store. Options I can see are:

- hold the store uOp in the LSQ while the cache miss buffer tracks
the state of the miss, then replay the store on miss completion.

- hand off the store data to the cache and stash its data bytes
in the miss buffers (so each miss buffer has a 64-byte stash line)
When the line arrives it merges with the stash bytes then
the whole line moves to its cache destination.
This delays victim eviction until new line arrives,
and stalls with a miss buffer in use if the victim buffer is full.

- allocate a destination cache line on miss and
stash the store bytes in the destination line,
while the miss buffer has a 64-bit mask to record which bytes
are already valid in the dest line, which merges when line arrives.
This evicts victim line on miss and stalls if victim buffer is full.

MitchAlsup

unread,

Jun 7, 2022, 12:19:38 PM6/7/22

to

Access 16 on the left and 16 on the right and combine them with
a single gate delay multiplexer. {But I agree that CAMs cannot be
made arbitrarily large and fast at the same time}

>
> In that case it sets the maximum number of miss buffers at 16.
> Victim buffers also need indexing if one wants to be able to
> pull back a victim so similar size limits apply.
<

How often do you need more than 16 miss buffers for a single thread?

It was a CAM--wire was significantly faster in 1991 than today compared
to gates.

>
> On some related questions about LSQ design, when the oldest store retires
> if it gets a $L1 cache miss it is not clear whether it is better to
> get the uOp out of the LSQ and free up a slot, or to hold the uOp
> and its data in the LSQ until the miss completes and then replay
> the missed store. Options I can see are:
<

It does not retire until the line is present in the cache (or CC) or when
it has been seen that a younger memref will overwrite the line in the
cache--in which case the ST is retired directly to DRAM.

EricP

unread,

Jun 7, 2022, 1:16:25 PM6/7/22

to

I have been trying to get my head around the whole OoO LSQ and
D$L1 cache design and their mutual interaction.

These are the thoughts I had so far on its structure and sequencing...

Your 48*48 matrix sounds similar to an idea I had for load misses.
I didn't like the proposal Jouppi had for tracking one missed load in a
miss buffer and stalling if a second miss load to the same line occurs.
That sounds like it would stall too soon in an OoO uArch.

In my gedanken-design I see the LSQ as a circular buffer so once a
new entry is added at the head it remains in that slot index until
removed from the tail (as opposed to, say, a collapsing buffer design).

Each LSQ entry is like a uOp except more complex as each has multiple
states to sequence through with various Ready or Wait states, and multiple
different schedulers selecting entries waiting for some operation.
The LSQ has multiple ports for adding new entries at head,
plus ports selected by each different scheduling operation.

If an address calculation is required, it is routed through the
AGU Reservation Stations, which has its own AGU-RS scheduler and
writes its result Virtual Address into the associated LSQ entry
and set the entry's VA_Ready flag.

When the LSQ entry has a valid virtual address and its VA_Ready flag
is set it requests translation, which the translate scheduler
(a circular priority selector) sees and reads its VA through a LSQ port
to the translate/table_walk machine. The entry waits until translate
stores the 1 (or 2 if a straddle) physical addresses
and sets the PA_Ready flag, or writes an error code.
(It it is a translate error then the entry follows a different course.)

The table walker is one or more state machines that bypasses the LSQ,
arbitrates with LSQ for the bus to talk to the cache. It can read and
write PTE's from valid cache lines, and multiple walkers might have
multiple cache misses outstanding at once. Because walkers bypass the
LSQ and talk straight to cache they don't get stuck behind stalled
LSQ entries. But it also requires an LSQ flush after writes to PTE's.

When an entry's PA_Ready flag is set, a scheduler (again a circular
priority selector) sees it and selects the oldest entry,
which is passed to PA_Disambiguation to separate access by cache line,
so that all loads and stores to the same line are performed in order,
but separate cache lines accesses can be performed in any order
(that last statement is weaker than x86-TSO coherence but
could be more restricted to all stores performed in order).

PA_Disambiguation is where your 48*48 dependency matrix comes into
play, as it decides when an entry can actually talk to cache.
When the dependency matrix indicates no older ancestors
it sets the entry's PA_Allowed flag to enable memory access.
This is where the Store Buffer and store-to-load forwarding occurs.
(I'm not sure where multiple byte store merging should take place.)

When the entry's PA_Allowed flag is set, a circular scheduler sees it
and arbitrates for the LSQ-D$L1 cache access bus. On grant it
sends its load PA-address or store PA-address and data to cache.

(This is the part that is different from how Jouppi handles cache misses.)
It also sends a one-hot bit mask with a bit set indicating which
LSQ entry sourced the operation. This is the wake-up ID if needed.

D$L1 checks the cache and Miss Buffers for the PA.
If it is a cache hit then it proceeds as normal read or write and ACK'd.

If a cache miss then the Miss Buffer CAM is checked to see if
already outstanding. If not a MB is allocated (or returns a
stall signal if none available), and the wake-up mask is OR'd.
In this example there were 48 LSQ entries so each MB has a 48 bit mask.
The cache returns a Miss signal to LSQ which changes the state
of the source LSQ entry from PA_Allowed to PA_Allowed_Wait,
which removes it temporarily from the schedule.

When the missed line returns from $L2 it checks the MB CAM and hits.
It merges the new line with any stashed store bytes,
and moves the merged line to the D$ banks.

The wake-up mask is send from D$L1 to LSQ over the 48-bit wake-up bus,
which changes the state of ALL those LSQ entries back to PA_Allowed
in 1 clock (this eliminates any sequential sequencing from wake-up).

The LSQ-D$_cache scheduler sees those entries again and replays it,
this time gets a load hit, D$L1 returns the data to the LSQ entry,
which stores the data in the entry and changes its state to Data_Valid.
The write-back scheduler selects a Data_Valid entry
and arbitrates for a data write-back/forwarding bus.

When load data has be sent, the LSQ entry is marked Complete.
Later a LSQ entry Retire unit move up the circular buffer tail
freeing the LSQ entry for reuse.

Whew...

MitchAlsup

unread,

Jun 7, 2022, 6:15:26 PM6/7/22

to

Collapsing buffers have major problems when instructions need to be replayed.

>
> Each LSQ entry is like a uOp except more complex as each has multiple
> states to sequence through with various Ready or Wait states, and multiple
> different schedulers selecting entries waiting for some operation.
<

In the case of Conditional Cache: the uOp can be fired to access lower in
the cache hierarchy, fired to send a now "in-order" LD to completion,
fired to send a ST with data to {D$ or lower in the hierarchy}, fired to
access the D$, and a bunch more.

<
> The LSQ has multiple ports for adding new entries at head,
> plus ports selected by each different scheduling operation.
<

I add them issue-width at a time coordinated with branch checkpoints
to make backup easier.

>
> If an address calculation is required, it is routed through the
> AGU Reservation Stations, which has its own AGU-RS scheduler and
> writes its result Virtual Address into the associated LSQ entry
> and set the entry's VA_Ready flag.
<

Note: A memref can AGEN multiple times under replay.

>
> When the LSQ entry has a valid virtual address and its VA_Ready flag
> is set it requests translation, which the translate scheduler
> (a circular priority selector) sees and reads its VA through a LSQ port
> to the translate/table_walk machine. The entry waits until translate
> stores the 1 (or 2 if a straddle) physical addresses
> and sets the PA_Ready flag, or writes an error code.
> (It it is a translate error then the entry follows a different course.)
<

When an entry has VA but not PA it can neither complete, not access lower
in the hierarchy. Depending on the characteristics of the D$ it may or may
not access D$.

>
> The table walker is one or more state machines that bypasses the LSQ,
> arbitrates with LSQ for the bus to talk to the cache. It can read and
> write PTE's from valid cache lines, and multiple walkers might have
> multiple cache misses outstanding at once. Because walkers bypass the
> LSQ and talk straight to cache they don't get stuck behind stalled
> LSQ entries. But it also requires an LSQ flush after writes to PTE's.
<

I always bypassed L1 caches when tablewalking accessing typically L2.
<
In Mc 88120 we did an experiment on multiple simultaneous tablewalks
and found that complexity wanting. 64-bit VA and PA may have changed
that situation.

>
> When an entry's PA_Ready flag is set, a scheduler (again a circular
> priority selector) sees it and selects the oldest entry,
<

Oldest entry that has no pending dependencies.

<
> which is passed to PA_Disambiguation to separate access by cache line,
> so that all loads and stores to the same line are performed in order,
<

Mc 88120 could perform 3 LDs and 3 STs to the came cache line
simultaneously, independent of the LD-ST orders. No data got lost
and no LD got data younger than it was supposed to see. This was
part of what MDM provided. So, while only 3 AGENs could be performed
several other MemRefs could be fired out of CC while AGENing was
going on.

<
> but separate cache lines accesses can be performed in any order
> (that last statement is weaker than x86-TSO coherence but
> could be more restricted to all stores performed in order).
<

Mc 88120 was restricted by 4-bank D$ and direct mapping used
VA address bits (16KB cache, 4KB pages) CC was not restricted.
Also note, Packet$ misses were satisfied through D$ (making it
a U$)

>
> PA_Disambiguation is where your 48*48 dependency matrix comes into
> play, as it decides when an entry can actually talk to cache.
<

Access D$ as quickly as possible. Decide if LD can deliver its
result later based on MDM. {Remember LDs can fire from CC
as dependencies resolve.} When SD dependencies resolve
STs can write to D$ or lower levels of hierarchy.

<
> When the dependency matrix indicates no older ancestors
> it sets the entry's PA_Allowed flag to enable memory access.
<

We never "set" anything, but when dependencies resolved MDM did assert
essentially the same signal as what you implied we set. The wording is
different because replay (K9) could reAGEN and we had to go back to
original dependencies. '120 backed up on this case (0 cycles) but K9
did not have instantaneous backup like '120.

<
> This is where the Store Buffer and store-to-load forwarding occurs.
> (I'm not sure where multiple byte store merging should take place.)
<

As a LD accesses CC, each address matching ST that has written data
asserts, and we use a age mask and find firsts to sort out which data is
accumulated for the LD.

>
> When the entry's PA_Allowed flag is set, a circular scheduler sees it
> and arbitrates for the LSQ-D$L1 cache access bus. On grant it
> sends its load PA-address or store PA-address and data to cache.
<

We read D$ (when no bank conflicts) and put a line into CC (128 bits) in
'120. IN K9 we read sub-lines into the similar mechanism. So each CC
had a complete set of data (although it had byte masks to sort out
various difficulties). We just had to decide which entry was to supply
which piece of data. '120 copied forward after AGEN, K9 did not. Both
sets of logic were of similar complexity.

>
> (This is the part that is different from how Jouppi handles cache misses.)
> It also sends a one-hot bit mask with a bit set indicating which
> LSQ entry sourced the operation. This is the wake-up ID if needed.
<

'120 wakeup ID was the relaxed MDM on an entry and the entry being
in any undelivered but completed state; and of course, that checkpoint
having passed the "consistent point".

>
> D$L1 checks the cache and Miss Buffers for the PA.
> If it is a cache hit then it proceeds as normal read or write and ACK'd.
<

Misses were played out "from" the CC. Data was returned with address
and if anyone in the CC wanted the data, it would be snarfed in. This
enabled an external party to throw data at the CPU and if timing was
acceptable, the access would see 0 added latency. Bell Northern was
going to use this to feed instructions to the CPU before the CPU got
around to wanting them.

>
> If a cache miss then the Miss Buffer CAM is checked to see if
> already outstanding. If not a MB is allocated (or returns a
> stall signal if none available), and the wake-up mask is OR'd.
> In this example there were 48 LSQ entries so each MB has a 48 bit mask.
> The cache returns a Miss signal to LSQ which changes the state
> of the source LSQ entry from PA_Allowed to PA_Allowed_Wait,
> which removes it temporarily from the schedule.
<

'120 and K9 had no miss buffer (other than CC).

>
> When the missed line returns from $L2 it checks the MB CAM and hits.
> It merges the new line with any stashed store bytes,
> and moves the merged line to the D$ banks.
<

Don't write D$ wile data is in CC, wait until instruction retires.....

>
> The wake-up mask is send from D$L1 to LSQ over the 48-bit wake-up bus,
> which changes the state of ALL those LSQ entries back to PA_Allowed
> in 1 clock (this eliminates any sequential sequencing from wake-up).
<

'120 and K9 had the MDM attached to the CC in such a way that his bus
was no more than a set of wires no longer than about 6 gates of length.
{You can't really call it a bus at that point--just 48 short wires.}

>
> The LSQ-D$_cache scheduler sees those entries again and replays it,
> this time gets a load hit, D$L1 returns the data to the LSQ entry,
> which stores the data in the entry and changes its state to Data_Valid.
> The write-back scheduler selects a Data_Valid entry
> and arbitrates for a data write-back/forwarding bus.
>
> When load data has be sent, the LSQ entry is marked Complete.
> Later a LSQ entry Retire unit move up the circular buffer tail
> freeing the LSQ entry for reuse.
>
> Whew...
<

Whew, indeed. The CC was one of my more interesting "adventures" in
computer architecture. There was the VA CAM, the PA CAM, the DATA
store, the MDM, and several schedulers (circular FF1s.) The way it was
designed resulted in a circuit that was only 6-wires wide !!!