The design of the control unit is one such factor.
Another thing is the cache. Suppose that, for the level-2 cache at
least, I want to get really fancy. How is it done?
With 2-way associativity, it's simple enough - you use one cache line,
and you mark it as "most recent" and the other as "least recent".
But with a fully-associative cache, what is one supposed to do? Fully
timestamp every memory access? Setting a counter on the cache line
used to zero, and incrementing all the others, could well lead to
counters overflowing on cache lines that weren't used for a while -
since the need to discard something to make room for something new
might only come along infrequently.
Well, in the reference I was reading, I didn't see an implementation
technique, but a little thought allowed me to realize one way this
could be done.
For every group of 8 cache lines in an 8-way associative cache, one
has three-bit counters with their own circuitry, operating in
parallel...
and what happens when memory is referenced is this:
The counter of the cache line accessed is set to zero - and its *old
value* is broadcast to the counters of the other cache lines. Those
counters increment only if their value is *less* than that value. (We
will see below why it should actually be "less than or equal to".)
That allows a true LRU policy to be maintained without fancy
timestamps or enormous counters.
So if the counters contain, say,
7 5 0 3 1 6 4 2
and one keeps having multiple repeated accesses to addresses ending in
"2" and "4", corresponding to the counters at "0" or "1", a hit to the
counter at 0 changes nothing, and a hit to the counter at 1 sets it to
0, incrementing _only_ the counter at 0.
If, instead, the next access was to an address ending in "7", the
counters at 0 and 1 would be the only ones incremented, so the new
state would be:
7 5 1 3 2 6 4 0
So all the counters stay different if they're all different to begin
with. If they're all the same to start with, all at zero, though, then
one change which is not relevant to the all different case is required
- increment all the counters less than *or equal* to the former value
of the counter set to zero.
Then one hit would change
0 0 0 0 0 0 0 0
to
1 1 1 1 1 0 1 1
and a second hit would result in, say
2 2 0 2 2 1 2 2
so that the counters would accurately show which lines were least
recently used. (However, it may be better to start with a random
permutation of the numbers from 0 to 7 anyways, so that the least
recently used one always is indicated by the counter being equal to
7.)
John Savard
Are you implying that one bit per cache line would be used? It is
only necessary to use one bit per pair of cache lines for 2-way LRU.
The bit indicates which line is LRU.
> But with a fully-associative cache, what is one supposed to do? Fully
> timestamp every memory access? Setting a counter on the cache line
> used to zero, and incrementing all the others, could well lead to
> counters overflowing on cache lines that weren't used for a while -
> since the need to discard something to make room for something new
> might only come along infrequently.
An alternative to using a timer-based timestamp that has been
suggested
for use with skewed associative caches is the use of a miss counter.
(In a non-fully associative cache, less significant bits of the
timestamp
can be removed with limited penalty under the assumption that misses
will
be somewhat evenly distributed.)
I receive the impression that the storage overhead of even partial
timestamps is considered excessive. (Tag checks for a large fully
associative cache would also be problematic.)
[snip]
> For every group of 8 cache lines in an 8-way associative cache, one
> has three-bit counters with their own circuitry, operating in
> parallel...
8-way associative caches usually use binary-tree pseudoLRU. Seven
bits per set are used; each bit indicates if the MRU access was in
the left or right half of its group of cache lines.
True LRU only has P(8,8) states, 40,320 states (correct?) can be
held in six bits. Some logic could then translate the six bit value
to a replacement choice in the case of a miss and a new six bit
value or a new six bit value based on the additional three bits
representing the cache line accessed on a hit. (I would not be
surprised if one could simplify the logic by a modest increase in
storage.)
(Using the method proposed by the original poster would require 24
bits of storage per set, albeit such would presumably have a much
simpler logic for handling transitions.)
Since LRU is _only_ a heuristic, exactness is usually not considered
worth great effort.
The NRU method used by one of the Itanium L3 implementations used
one bit per cache line. It set the bit on access and if all bits
were set, then all bits were cleared. The victim was selected by
finding the first unset bit. (This might be interesting when
applied to a skewed associative cache in conjunction with tree-based
pLRU in a 4-way bundle. E.g., eight cache line bundles could be
grouped together with eight bits indicated recently-used. It might
even be appropriate to use bits of the address to select at which
bit [and perhaps from which direction] to begin the find-first-unset
bit, allowing a more randomized replacement than always searching
from the left end to the right. Presumably an access that resets
the bitfield would set the bit corresponding to the accessed cache
line bundle. [I receive the impression that the Itanium cache only
cleared the bitfield, so there was presumably some bias in
replacement?] Even in a conventional indexing cache, using an NRU
bitfield to select a group managed by LRU [for two entries] or pLRU
might be worth considering.)
Yes. Unfortunately, this approximation is much harder to analyse for
worst-case execution time than true LRU. So if you design a CPU for
hard-real-time systems, true LRU is a definite plus. IIRC as far as
the analysis is concerned, such an 8-way pseudoLRU cache gives the
same worst-case result as a 2-way LRU cache with a quarter of the size
(i.e., 6 of your 8 ways don't help in this context).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
> 8-way associative caches usually use binary-tree pseudoLRU. Seven
> bits per set are used; each bit indicates if the MRU access was in
> the left or right half of its group of cache lines.
Ah. This is good to know.
> True LRU only has P(8,8) states, 40,320 states (correct?) can be
> held in six bits. Some logic could then translate the six bit value
> to a replacement choice in the case of a miss and a new six bit
> value or a new six bit value based on the additional three bits
> representing the cache line accessed on a hit. (I would not be
> surprised if one could simplify the logic by a modest increase in
> storage.)
Well, I was just happy that I worked out a way to do LRU that seemed
reasonably fast, keeping things down to a modest number of gate delays
(perhaps still too many) which was simple enough for me to understand.
I was not prepared to think of going *there*!
John Savard
I must be misreading you here -- six bits can only hold 64 states. What
did you mean to say?
Aargh! It was just a brain malfunction! Obviously sixTEEN bits are
needed. (This is still a little less than the 24 bits of the
proposed
method, but perhaps not enough to justify the more complex state
update logic.)
Thank you for catching the error!
16 bits, and the victim id is not stored in a convenient manner.
To do it this way you'd need a 40320 rows*19 bit ROM,
the output being a 3 bit victim id and a 16 bit next state.
Maybe with clever encoding you could arrange it so the
lower 3 bits of the next state were the victim id so
you could get it down to "only" 40320*16 = 645120 bits,
plus a 16 => 64k decoder.
For 4 ways it wouldn't be bad though. 4! = 24 rows of 5 bits.
Eric
That since 8! is 40320 you would need at least 16 bits to hold all that
info?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
The mechanism I came up with for True LRU used shift registers.
For 8 ways it needs 3 bits for the way number, 1 valid bit,
by 8 entries (numbered 1 to 8).
Each entry has a 3 XORs for matching ids and some glue logic.
New Way Ids enter on the left, and shift to the right.
So the MRU id is in position 1, and LRU in position 8.
E.G.
Pos 1 2 3 4 5 6 7 8
WayId => 3 6 2 4 1 7 0 5
When a Way is touched we shift its id in on the left,
and selectively shift all entries towards the right
*that are younger (left of) the matching value*.
So if Way 1 is touched we match it in Pos 5, shift right all
entries younger than Pos 5 and shift Way 1 in on the left.
Pos 6, 7, 8 do not shift.
Pos 1 2 3 4 5 6 7 8
WayId => 1 3 6 2 4 7 0 5
If we need a victim, it is in Pos 8 (Way 5 in this example).
This requires (3+1)*8 master-slave latch + 3*8 XOR plus
8 3-input AND plus sundry glue, or about 32*16+24*4+8*6 =
about 656 transisitors for each cache line set,
works in a single clock, and scales at (log_2(N)+1)*N.
Eric
> > I must be misreading you here -- six bits can only hold 64 states. What
> > did you mean to say?
>
> That since 8! is 40320 you would need at least 16 bits to hold all that
> info?
No, that's what _he_ meant to say. 16 bits instead of 6 bits is the
answer to his question.
John Savard
Hmmm... this is wrong.
We have a current state code and when we hit on a Way number
we use the current state plus the hit way to map to the new state.
So it needs 16 bits for the 40320 current state codes,
plus 3 bits for the Way number that hits,
and that maps to a new 16 bits state code
(which we arrange so that the victim id is in the lower 3 bits).
So it needs a 40320*8*16= 5,160,960 bit ROM,
plus a 19 => 524,288 decoder.
> For 4 ways it wouldn't be bad though. 4! = 24 rows of 5 bits.
For 4 Ways it needs 24*4 rows of 5 bits = 480 bits.
Eric
There are many schemes that I call pseudo-LRU, not just tree-LRU.
E.g. the hardware equivalent of the OS clock algorithm.
I thought the clock algorithm (if I understand correctly, this
is what one of the Itanium's L3 cache used) was called NRU
(Not Recently Used). At least I _think_ the paper mentioning
the Itanium implementation used 'NRU'.
It seems obvious that using logic to 'calculate' the victim
and new state would have less overhead.
For a 4-way LRU, using two bits to identify the MRU way
number, two bits to identify the second MRU way number and
one bit to identify which of the remaining ways was next
most recently used. Using direct encodings of the two MRU
ways could simplify update (which would probably be
preferred over simplifying victim selection since hits are
more likely than misses and victim selection is probably
less latency sensitive), a second MRU hit could simply
swap the MRU entries (with an MRU hit obviously leaving
the state unchanged), so more than 50% of the time
(assuming an access pattern friendly to LRU replacement)
the state update would be 'trivial'.
Along similar lines, for an eight-way LRU one could divide
the 16 bits thus: 3 bits MRU, 3 bits second MRU, 5 bits
for the third (six states) and fourth (five states) entries,
2 bits for the fifth (four states) entry, 2 bits for the
sixth (three states) entry, and one bit for the seventh
entry. This would allow the data for all but two of the
entries to be extracted by bit extractions and generating
the quotient and remainder of a five bit number divided by
six is not especially difficult (one bit of the remainder
is a bit extraction, so it could be available early).
Obviously reencoding from invalidations would be more
involved (assuming invalidated blocks become LRU). (One
alternative would be to select a victim with a simple
left-to-right search for an invalid block and for update
treat the victim as a hit when an invalid block is the
victim.)
Ok, having the MRU directly accessible could be of value for,
say, Way prediction (for lower power),
and granted when it misses you don't need the LRU Way# immediately.
Your approach does eliminate 1/2 the possible states,
but the state space is still the same size.
When there is a hit, you compare the hit Way # to MRU and 2RU.
If it is not is either of those positions then
you would still have 24 current states (5 bits) + 2 bits for hit#
or 96 potential next states.
Because you tested MRU and 2RU and know that the Hit# does
not match either of then, that eliminates 48 states.
However that leaves 48 states sparsely sprinkled about
a 128 entry lookup table.
So you need a 7 bit => 48 decoder to generate the next state number.
And you need a separate lookup table to map the
5 bit, 24 current state to an LRU Way#.
> Along similar lines, for an eight-way LRU one could divide
> the 16 bits thus: 3 bits MRU, 3 bits second MRU, 5 bits
> for the third (six states) and fourth (five states) entries,
> 2 bits for the fifth (four states) entry, 2 bits for the
> sixth (three states) entry, and one bit for the seventh
> entry. This would allow the data for all but two of the
> entries to be extracted by bit extractions and generating
> the quotient and remainder of a five bit number divided by
> six is not especially difficult (one bit of the remainder
> is a bit extraction, so it could be available early).
Ok but you still have a state space of 16 bits + 3 bits for the Hit#,
divided by 2 because you explicitly picked off MRU and 2RU,
containing 40320 * 8 / 2 = 161280 next states.
So you still need a 19 => 161280 decoder to look up the next state
if it is not in the MRU or 2RU positions.
> Obviously reencoding from invalidations would be more
> involved (assuming invalidated blocks become LRU). (One
> alternative would be to select a victim with a simple
> left-to-right search for an invalid block and for update
> treat the victim as a hit when an invalid block is the
> victim.)
Yes, all this ignores invalidations, such as for coherency.
Invalidates double the state transitions because for them
you want to move the invalid Way# into the LRU position.
For example, if our MRU..LRU order is 0 1 2 3 and 2 gets invalidated,
we move it to the LRU position 0 1 3 2.
Later if 0 gets invalidated we get 1 3 2 0.
You shift the Way# in from the left for hits,
shift in from the right for invalidates.
Eric
Oops, minor boo boo.
In the case of 4 way, matching MRU and 2RU eliminates 2 of 4
possible state transitions.
But for 8 way, it only eliminates 2 of 8 transitions, so that
should be 40320 * 8 * 3/4 = 241920 entries in a next state table.
Plus a separate 40320 entry table for looking up LRU Way#.
Eric
Two comments:
1) True has already been done on associativities greater than 8 - in
real (aka, not a toy) products
2) There's an easier way to do it