On Friday, October 26, 2018 at 1:17:16 PM UTC-4, MitchAlsup wrote:
[snip]
> When one increases the size of the cache but not the size of the page
> one ends up having more sets of associativity in the cache to avoid an
> aliasing problem. Thus, as the 1st level cache gets bigger, it gets
> more power intensive.
>
> In the case of Athlon/Opteron, the 64KB caches were 4-way banked and
> 4-way setted in order to support 2 accesses per cycle, and the TLB
> was organized along with the cache tags so that both the PTE data and
> cache tag data arrived at an analog XOR gate with the sense amps reading
> if both agreed with each other.
ISTR that Athlon's 64 KiB L1 caches were only two-way
associative (and used a duplicate tag set to probe aliasable
ways on a cache miss). Probing alternate ways on a cache miss
(with optional filtering and other throughput/energy
optimizations) is one way to avoid the product of page size
and associativity cache size limit. A tag-inclusive L2 cache (which can be helpful in filtering coherence traffic)
indicating the set (or even specific index) would also avoid
the associativity constraint.
Way predication can reduce the access energy cost of higher
associaitivity. Predicting the least significant bits of the
physical address seems a closely related technique for
avoiding the associativity constraint (one could view way
group allocation as being determined by these bits, so the
prediction is a way group prediction).
Other techniques for handling aliases (aside from just not
allowing them) have been proposed.
What I find disappointing is that neither clustering nor
sub-block NUCA have been adopted to reduce latency. While
clustering has the obvious issues of cross-cluster
communication (where the load-to-use-for-load latency may
include the latency of accessing a different cluster's cache)
and layout implications, such would sometimes have lower
latency. Sub-block NUCA (whether of the least significant bits
of every word sufficient to provide indexing and perhaps even
help in way prediction (effectively width pipelining cache
access) or of whole selected, possibly dynamically, words)
would allow lower load-to-use latency.
Qualcomm's Falkor *might* be considered to have a block-based
NUCA L1 instruction cache if one considers the L0 instruction
cache as the fast portion of L1. (L0 and L1 are described as
exclusive, but the allocation and transfer policy does not
appear to be described publicly. If blocks are sometimes
allocated to L1 (not always promoted to L0) and blocks evicted
from L0 may bypass L1, then such would be rather NUCA-ish.
(Instruction caches offer interesting width pipelining
options. For instruction address relative and absolute
immediate branches, a BTB could be viewed as a (typically
redundant storage) form of width pipelining. With 16-byte
aligned fetches, a cache might be designed with the first
8-byte chunk in a closer portion and the second in a more
distant portion, allowing some of the instruction data to
have lower latency. Predecoding could bias critical
instruction information placement into the faster half, but
even without biases such NUCA could be similar to a
branch target instruction cache. A trace cache with two
fetch width long traces (like the Pentium 4) might be able
to have a two-cycle redirect latency with a nearly 3-cycle-
sized trace cache by fetching the first chunk from a half-
capacity fast region.)
(Width pipelining methods could also be applied to memory accesses, allowing the first beat of a DRAM access to be
used early while having a wide interface and moderate extra
tracking complexity. Laying out data in memory according to
expected criticality could reduce latency (at the cost of a
little extra metadata).)