True LRU With 8-Way Associativity Is Implementable

Quadibloc

unread,

Jul 19, 2011, 11:38:02 PM7/19/11

to

In my thoughts about a new computer design, and in the section of my
site that illustrates some implementation information, such as how a
Wallace Tree multiplier works, I've glossed over many other important
details of a computer.

The design of the control unit is one such factor.

Another thing is the cache. Suppose that, for the level-2 cache at
least, I want to get really fancy. How is it done?

With 2-way associativity, it's simple enough - you use one cache line,
and you mark it as "most recent" and the other as "least recent".

But with a fully-associative cache, what is one supposed to do? Fully
timestamp every memory access? Setting a counter on the cache line
used to zero, and incrementing all the others, could well lead to
counters overflowing on cache lines that weren't used for a while -
since the need to discard something to make room for something new
might only come along infrequently.

Well, in the reference I was reading, I didn't see an implementation
technique, but a little thought allowed me to realize one way this
could be done.

For every group of 8 cache lines in an 8-way associative cache, one
has three-bit counters with their own circuitry, operating in
parallel...

and what happens when memory is referenced is this:

The counter of the cache line accessed is set to zero - and its *old
value* is broadcast to the counters of the other cache lines. Those
counters increment only if their value is *less* than that value. (We
will see below why it should actually be "less than or equal to".)

That allows a true LRU policy to be maintained without fancy
timestamps or enormous counters.

So if the counters contain, say,

7 5 0 3 1 6 4 2

and one keeps having multiple repeated accesses to addresses ending in
"2" and "4", corresponding to the counters at "0" or "1", a hit to the
counter at 0 changes nothing, and a hit to the counter at 1 sets it to
0, incrementing _only_ the counter at 0.

If, instead, the next access was to an address ending in "7", the
counters at 0 and 1 would be the only ones incremented, so the new
state would be:

7 5 1 3 2 6 4 0

So all the counters stay different if they're all different to begin
with. If they're all the same to start with, all at zero, though, then
one change which is not relevant to the all different case is required
- increment all the counters less than *or equal* to the former value
of the counter set to zero.

Then one hit would change

0 0 0 0 0 0 0 0

to

1 1 1 1 1 0 1 1

and a second hit would result in, say

2 2 0 2 2 1 2 2

so that the counters would accurately show which lines were least
recently used. (However, it may be better to start with a random
permutation of the numbers from 0 to 7 anyways, so that the least
recently used one always is indicated by the counter being equal to
7.)

John Savard

Paul A. Clayton

unread,

Jul 20, 2011, 1:19:24 AM7/20/11

to

On Jul 19, 11:38 pm, Quadibloc <jsav...@ecn.ab.ca> wrote:
[snip]

> With 2-way associativity, it's simple enough - you use one cache line,
> and you mark it as "most recent" and the other as "least recent".

Are you implying that one bit per cache line would be used? It is
only necessary to use one bit per pair of cache lines for 2-way LRU.
The bit indicates which line is LRU.

> But with a fully-associative cache, what is one supposed to do? Fully
> timestamp every memory access? Setting a counter on the cache line
> used to zero, and incrementing all the others, could well lead to
> counters overflowing on cache lines that weren't used for a while -
> since the need to discard something to make room for something new
> might only come along infrequently.

An alternative to using a timer-based timestamp that has been
suggested
for use with skewed associative caches is the use of a miss counter.

(In a non-fully associative cache, less significant bits of the
timestamp
can be removed with limited penalty under the assumption that misses
will
be somewhat evenly distributed.)

I receive the impression that the storage overhead of even partial
timestamps is considered excessive. (Tag checks for a large fully
associative cache would also be problematic.)

[snip]

> For every group of 8 cache lines in an 8-way associative cache, one
> has three-bit counters with their own circuitry, operating in
> parallel...

8-way associative caches usually use binary-tree pseudoLRU. Seven
bits per set are used; each bit indicates if the MRU access was in
the left or right half of its group of cache lines.

True LRU only has P(8,8) states, 40,320 states (correct?) can be
held in six bits. Some logic could then translate the six bit value
to a replacement choice in the case of a miss and a new six bit
value or a new six bit value based on the additional three bits
representing the cache line accessed on a hit. (I would not be
surprised if one could simplify the logic by a modest increase in
storage.)

(Using the method proposed by the original poster would require 24
bits of storage per set, albeit such would presumably have a much
simpler logic for handling transitions.)

Since LRU is _only_ a heuristic, exactness is usually not considered
worth great effort.

The NRU method used by one of the Itanium L3 implementations used
one bit per cache line. It set the bit on access and if all bits
were set, then all bits were cleared. The victim was selected by
finding the first unset bit. (This might be interesting when
applied to a skewed associative cache in conjunction with tree-based
pLRU in a 4-way bundle. E.g., eight cache line bundles could be
grouped together with eight bits indicated recently-used. It might
even be appropriate to use bits of the address to select at which
bit [and perhaps from which direction] to begin the find-first-unset
bit, allowing a more randomized replacement than always searching
from the left end to the right. Presumably an access that resets
the bitfield would set the bit corresponding to the accessed cache
line bundle. [I receive the impression that the Itanium cache only
cleared the bitfield, so there was presumably some bias in
replacement?] Even in a conventional indexing cache, using an NRU
bitfield to select a group managed by LRU [for two entries] or pLRU
might be worth considering.)

Anton Ertl

unread,

Jul 20, 2011, 7:58:25 AM7/20/11

to

"Paul A. Clayton" <paaron...@gmail.com> writes:
>8-way associative caches usually use binary-tree pseudoLRU. Seven
>bits per set are used; each bit indicates if the MRU access was in
>the left or right half of its group of cache lines.

Yes. Unfortunately, this approximation is much harder to analyse for
worst-case execution time than true LRU. So if you design a CPU for
hard-real-time systems, true LRU is a definite plus. IIRC as far as
the analysis is concerned, such an 8-way pseudoLRU cache gives the
same worst-case result as a 2-way LRU cache with a quarter of the size
(i.e., 6 of your 8 ways don't help in this context).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Quadibloc

unread,

Jul 20, 2011, 10:41:48 AM7/20/11

to

On Jul 19, 11:19 pm, "Paul A. Clayton" <paaronclay...@gmail.com>
wrote:

> 8-way associative caches usually use binary-tree pseudoLRU. Seven
> bits per set are used; each bit indicates if the MRU access was in
> the left or right half of its group of cache lines.

Ah. This is good to know.

> True LRU only has P(8,8) states, 40,320 states (correct?) can be
> held in six bits. Some logic could then translate the six bit value
> to a replacement choice in the case of a miss and a new six bit
> value or a new six bit value based on the additional three bits
> representing the cache line accessed on a hit. (I would not be
> surprised if one could simplify the logic by a modest increase in
> storage.)

Well, I was just happy that I worked out a way to do LRU that seemed
reasonably fast, keeping things down to a modest number of gate delays
(perhaps still too many) which was simple enough for me to understand.

I was not prepared to think of going *there*!

John Savard

Joe Pfeiffer

unread,

Jul 20, 2011, 11:50:17 AM7/20/11

to

"Paul A. Clayton" <paaron...@gmail.com> writes:
>

> True LRU only has P(8,8) states, 40,320 states (correct?) can be
> held in six bits. Some logic could then translate the six bit value

I must be misreading you here -- six bits can only hold 64 states. What
did you mean to say?

Paul A. Clayton

unread,

Jul 20, 2011, 1:06:41 PM7/20/11

to

On Jul 20, 11:50 am, Joe Pfeiffer <pfeif...@cs.nmsu.edu> wrote:

Aargh! It was just a brain malfunction! Obviously sixTEEN bits are
needed. (This is still a little less than the 24 bits of the
proposed
method, but perhaps not enough to justify the more complex state
update logic.)

Thank you for catching the error!

EricP

unread,

Jul 20, 2011, 1:16:06 PM7/20/11

to

Paul A. Clayton wrote:
>
> True LRU only has P(8,8) states, 40,320 states (correct?) can be
> held in six bits. Some logic could then translate the six bit value
> to a replacement choice in the case of a miss and a new six bit
> value or a new six bit value based on the additional three bits
> representing the cache line accessed on a hit. (I would not be
> surprised if one could simplify the logic by a modest increase in
> storage.)

16 bits, and the victim id is not stored in a convenient manner.
To do it this way you'd need a 40320 rows*19 bit ROM,
the output being a 3 bit victim id and a 16 bit next state.
Maybe with clever encoding you could arrange it so the
lower 3 bits of the next state were the victim id so
you could get it down to "only" 40320*16 = 645120 bits,
plus a 16 => 64k decoder.

For 4 ways it wouldn't be bad though. 4! = 24 rows of 5 bits.

Eric

Terje Mathisen

unread,

Jul 20, 2011, 1:55:30 PM7/20/11

to

That since 8! is 40320 you would need at least 16 bits to hold all that
info?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Jul 20, 2011, 2:12:59 PM7/20/11

to

Quadibloc wrote:
>
> Well, in the reference I was reading, I didn't see an implementation
> technique, but a little thought allowed me to realize one way this
> could be done.

The mechanism I came up with for True LRU used shift registers.
For 8 ways it needs 3 bits for the way number, 1 valid bit,
by 8 entries (numbered 1 to 8).
Each entry has a 3 XORs for matching ids and some glue logic.

New Way Ids enter on the left, and shift to the right.
So the MRU id is in position 1, and LRU in position 8.
E.G.

Pos 1 2 3 4 5 6 7 8
WayId => 3 6 2 4 1 7 0 5

When a Way is touched we shift its id in on the left,
and selectively shift all entries towards the right
*that are younger (left of) the matching value*.

So if Way 1 is touched we match it in Pos 5, shift right all
entries younger than Pos 5 and shift Way 1 in on the left.
Pos 6, 7, 8 do not shift.

Pos 1 2 3 4 5 6 7 8
WayId => 1 3 6 2 4 7 0 5

If we need a victim, it is in Pos 8 (Way 5 in this example).

This requires (3+1)*8 master-slave latch + 3*8 XOR plus
8 3-input AND plus sundry glue, or about 32*16+24*4+8*6 =
about 656 transisitors for each cache line set,
works in a single clock, and scales at (log_2(N)+1)*N.

Eric

Quadibloc

unread,

Jul 20, 2011, 2:17:01 PM7/20/11

to

On Jul 20, 11:55 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Joe Pfeiffer wrote:

> > I must be misreading you here -- six bits can only hold 64 states. What
> > did you mean to say?
>
> That since 8! is 40320 you would need at least 16 bits to hold all that
> info?

No, that's what _he_ meant to say. 16 bits instead of 6 bits is the
answer to his question.

John Savard

Joe Pfeiffer

unread,

Jul 20, 2011, 2:38:02 PM7/20/11

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

Ah, this makes a lot more sense!

EricP

unread,

Jul 20, 2011, 2:57:27 PM7/20/11

to

EricP wrote:
>
> 16 bits, and the victim id is not stored in a convenient manner.
> To do it this way you'd need a 40320 rows*19 bit ROM,
> the output being a 3 bit victim id and a 16 bit next state.
> Maybe with clever encoding you could arrange it so the
> lower 3 bits of the next state were the victim id so
> you could get it down to "only" 40320*16 = 645120 bits,
> plus a 16 => 64k decoder.

Hmmm... this is wrong.
We have a current state code and when we hit on a Way number
we use the current state plus the hit way to map to the new state.

So it needs 16 bits for the 40320 current state codes,
plus 3 bits for the Way number that hits,
and that maps to a new 16 bits state code
(which we arrange so that the victim id is in the lower 3 bits).
So it needs a 40320*8*16= 5,160,960 bit ROM,
plus a 19 => 524,288 decoder.

> For 4 ways it wouldn't be bad though. 4! = 24 rows of 5 bits.

For 4 Ways it needs 24*4 rows of 5 bits = 480 bits.

Eric

Andy "Krazy" Glew

unread,

Jul 21, 2011, 11:03:48 PM7/21/11

to

There are many schemes that I call pseudo-LRU, not just tree-LRU.

E.g. the hardware equivalent of the OS clock algorithm.

Paul A. Clayton

unread,

Jul 24, 2011, 12:34:33 AM7/24/11

to

On Jul 21, 11:03 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> There are many schemes that I call pseudo-LRU, not just tree-LRU.
>
> E.g. the hardware equivalent of the OS clock algorithm.

I thought the clock algorithm (if I understand correctly, this
is what one of the Itanium's L3 cache used) was called NRU
(Not Recently Used). At least I _think_ the paper mentioning
the Itanium implementation used 'NRU'.

Paul A. Clayton

unread,

Aug 2, 2011, 1:26:32 PM8/2/11

to

On Jul 20, 1:16 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
[snip]

> 16 bits, and the victim id is not stored in a convenient manner.
> To do it this way you'd need a 40320 rows*19 bit ROM,
> the output being a 3 bit victim id and a 16 bit next state.
> Maybe with clever encoding you could arrange it so the
> lower 3 bits of the next state were the victim id so
> you could get it down to "only" 40320*16 = 645120 bits,
> plus a 16 => 64k decoder.
>
> For 4 ways it wouldn't be bad though. 4! = 24 rows of 5 bits.

It seems obvious that using logic to 'calculate' the victim
and new state would have less overhead.

For a 4-way LRU, using two bits to identify the MRU way
number, two bits to identify the second MRU way number and
one bit to identify which of the remaining ways was next
most recently used. Using direct encodings of the two MRU
ways could simplify update (which would probably be
preferred over simplifying victim selection since hits are
more likely than misses and victim selection is probably
less latency sensitive), a second MRU hit could simply
swap the MRU entries (with an MRU hit obviously leaving
the state unchanged), so more than 50% of the time
(assuming an access pattern friendly to LRU replacement)
the state update would be 'trivial'.

Along similar lines, for an eight-way LRU one could divide
the 16 bits thus: 3 bits MRU, 3 bits second MRU, 5 bits
for the third (six states) and fourth (five states) entries,
2 bits for the fifth (four states) entry, 2 bits for the
sixth (three states) entry, and one bit for the seventh
entry. This would allow the data for all but two of the
entries to be extracted by bit extractions and generating
the quotient and remainder of a five bit number divided by
six is not especially difficult (one bit of the remainder
is a bit extraction, so it could be available early).

Obviously reencoding from invalidations would be more
involved (assuming invalidated blocks become LRU). (One
alternative would be to select a victim with a simple
left-to-right search for an invalid block and for update
treat the victim as a hit when an invalid block is the
victim.)

EricP

unread,

Aug 3, 2011, 2:00:33 PM8/3/11

to

Ok, having the MRU directly accessible could be of value for,
say, Way prediction (for lower power),
and granted when it misses you don't need the LRU Way# immediately.

Your approach does eliminate 1/2 the possible states,
but the state space is still the same size.

When there is a hit, you compare the hit Way # to MRU and 2RU.
If it is not is either of those positions then
you would still have 24 current states (5 bits) + 2 bits for hit#
or 96 potential next states.
Because you tested MRU and 2RU and know that the Hit# does
not match either of then, that eliminates 48 states.
However that leaves 48 states sparsely sprinkled about
a 128 entry lookup table.

So you need a 7 bit => 48 decoder to generate the next state number.

And you need a separate lookup table to map the
5 bit, 24 current state to an LRU Way#.

> Along similar lines, for an eight-way LRU one could divide
> the 16 bits thus: 3 bits MRU, 3 bits second MRU, 5 bits
> for the third (six states) and fourth (five states) entries,
> 2 bits for the fifth (four states) entry, 2 bits for the
> sixth (three states) entry, and one bit for the seventh
> entry. This would allow the data for all but two of the
> entries to be extracted by bit extractions and generating
> the quotient and remainder of a five bit number divided by
> six is not especially difficult (one bit of the remainder
> is a bit extraction, so it could be available early).

Ok but you still have a state space of 16 bits + 3 bits for the Hit#,
divided by 2 because you explicitly picked off MRU and 2RU,
containing 40320 * 8 / 2 = 161280 next states.

So you still need a 19 => 161280 decoder to look up the next state
if it is not in the MRU or 2RU positions.

> Obviously reencoding from invalidations would be more
> involved (assuming invalidated blocks become LRU). (One
> alternative would be to select a victim with a simple
> left-to-right search for an invalid block and for update
> treat the victim as a hit when an invalid block is the
> victim.)

Yes, all this ignores invalidations, such as for coherency.
Invalidates double the state transitions because for them
you want to move the invalid Way# into the LRU position.

For example, if our MRU..LRU order is 0 1 2 3 and 2 gets invalidated,
we move it to the LRU position 0 1 3 2.
Later if 0 gets invalidated we get 1 3 2 0.
You shift the Way# in from the left for hits,
shift in from the right for invalidates.

Eric

EricP

unread,

Aug 3, 2011, 4:30:01 PM8/3/11

to

EricP wrote:
>
> Ok but you still have a state space of 16 bits + 3 bits for the Hit#,
> divided by 2 because you explicitly picked off MRU and 2RU,
> containing 40320 * 8 / 2 = 161280 next states.
>
> So you still need a 19 => 161280 decoder to look up the next state
> if it is not in the MRU or 2RU positions.

Oops, minor boo boo.
In the case of 4 way, matching MRU and 2RU eliminates 2 of 4
possible state transitions.
But for 8 way, it only eliminates 2 of 8 transitions, so that
should be 40320 * 8 * 3/4 = 241920 entries in a next state table.

Plus a separate 40320 entry table for looking up LRU Way#.

Eric

dmackay

unread,

Aug 3, 2011, 8:00:27 PM8/3/11

to

On Jul 19, 10:38 pm, Quadibloc <jsav...@ecn.ab.ca> wrote:
> Another thing is the cache. Suppose that, for the level-2 cache at
> least, I want to get really fancy. How is it done?
>
> With 2-way associativity, it's simple enough - you use one cache line,
> and you mark it as "most recent" and the other as "least recent".
>
> But with a fully-associative cache, what is one supposed to do? Fully
> timestamp every memory access? Setting a counter on the cache line
> used to zero, and incrementing all the others, could well lead to
> counters overflowing on cache lines that weren't used for a while -
> since the need to discard something to make room for something new
> might only come along infrequently.
>
> Well, in the reference I was reading, I didn't see an implementation
> technique, but a little thought allowed me to realize one way this
> could be done.
>
> For every group of 8 cache lines in an 8-way associative cache, one
> has three-bit counters with their own circuitry, operating in
> parallel...
>
> and what happens when memory is referenced is this:
>
> The counter of the cache line accessed is set to zero - and its *old
> value* is broadcast to the counters of the other cache lines. Those
> counters increment only if their value is *less* than that value. (We
> will see below why it should actually be "less than or equal to".)
>
> That allows a true LRU policy to be maintained without fancy
> timestamps or enormous counters.

Two comments:
1) True has already been done on associativities greater than 8 - in
real (aka, not a toy) products
2) There's an easier way to do it

zxwi...@gmail.com

unread,

Apr 23, 2013, 4:16:50 AM4/23/13

to

Which product has implemented true LRU on cache more than 8-associativity?

Paul A. Clayton

unread,

Apr 23, 2013, 10:07:42 AM4/23/13

to

On Apr 23, 4:16 am, zxwin...@gmail.com wrote:
[snip]

> Which product has implemented true LRU on cache more than 8-associativity?

While not the same as a cache, the Itanium 2 TLBs use
LRU replacement. The L1 TLBs (32 entries) are documented
as using "true LRU". It is somewhat less clear if the
L2 TLBs (128 entries with up to 64 locked) are also
truly LRU, though I suspect such is the case.

If 32-way (or possibly even 128-way) true LRU can be
handled in a TLB (where there is only one set so the
metadata storage and update logic can be integrated),
handling 8-way for a cache (with multiple sets) might
not be so horrible. However, I suspect the primary
advantage of true LRU in an 8-way associative cache
would be in WCET analysis (as Anton Ertl already mentioned:
Message-ID: <2011Jul2...@mips.complang.tuwien.ac.at>),
so I _suspect_ that the cache would be in an embedded
processor targeted for real-time use.

Anne & Lynn Wheeler

unread,

Apr 23, 2013, 11:07:50 AM4/23/13

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

> While not the same as a cache, the Itanium 2 TLBs use
> LRU replacement. The L1 TLBs (32 entries) are documented
> as using "true LRU". It is somewhat less clear if the
> L2 TLBs (128 entries with up to 64 locked) are also
> truly LRU, though I suspect such is the case.

360/67 had "true" LRU 8-entry associative array (TLB)

for other drift ... while undergraduate in the 60s ... i changed cp67
(on 360/67) to have (clock-like) "global" (approximate) LRU for page replacement
... this was about the same time there was ACM article about "local" LRU
for replacement.

at dec81 ACM SIGOPS, Jim Gray has me if I could help a co-worker with
his Stanford PHD ... on global LRU and clock page replacement. The
awarding of his PHD was strongly being opposed by the "local" LRU
forces. It turns out that in the early 70s, the Grenoble Scientific
Center had modified cp67 to implement the "local" LRU description ... so
we had cp67 systems running on similar hardware with similar workloads
for comparison of "local" and "global". The Cambridge Scientific Center
360/67 was 768kbyte (104 pageable pages after fixed storage
requirements) and would support 70-80 users with subsecond response
(with global). The Grenoble Scientific Center 360/67 was 1mbyte (156
pageable pages after fixed storage requirements) would support 35 users
with similar workload, response and throughput (with "local", half the
users with 50% more real storage).

It also turns out that I had done some other work on global clock
variations in the early 70s at the Cambridge Scientific Center ... and
in paging simulator (from full instruction traces) ... there was a clock
variation that could beat "TRUE LRU". The issue was that "TRUE LRU"
would get into pathelogical situations and degenerates to FIFO. The
global clock variation had a peculiar slight-of-hand that would
degenerate to RANDOM instead (in the situations where TRUE LRU
degenerated to FIFO).

I wrote a response ... but it took nearly a year to get permission
to send it ... some of it here
http://www.garlic.com/~lynn/2006w.html#email821019
in this past post
http://www.garlic.com/~lynn/2006w.htmL#46

I hoped that rather than research management taking sides in the
academic dispute over (local versus global) paging replacement
algorithms ... theirblocking my sending a reply was punishment for some
perceived transgressions (about this time I was being blamed for
computer conferencing on the internal network).

--
virtualization experience starting Jan1968, online at home since Mar1970

zxwi...@gmail.com

unread,

May 13, 2013, 9:58:27 AM5/13/13

to

Thanks very much, Paul, Anne & Lynn. It's really a great help.