caching paradox: the most used item is evicted first

v...@t.com

unread,

May 25, 2012, 6:16:04 AM5/25/12

to

Normally, there are several levels of caching in every memory
architecture. Thereby, L1 is backed by L2. Suppose one item is the
most used one. It is therefore kept in L1 and, therefore, very rarely
requested from L2. It is a frist candidate for eviction out of L2! The
most used item is evicted first!

Is it how the stuff works?

Terje Mathisen

unread,

May 25, 2012, 7:25:33 AM5/25/12

to

Can be.

What if the L2 cache is exclusive, i.e. it never caches anything that is
currently in L1, but saves it automatically back in L2 when evicted from L1?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Paul A. Clayton

unread,

May 25, 2012, 9:39:26 AM5/25/12

to

For forced exclusion (i.e., contents of L1 will not be in L2),
this is not a problem (as Terje noted). One might also note
that an L1 victim block might preferentially be evicted to
memory or L3 (if expected reuse distance is sufficiently great
that the block would likely be evicted from L2 anyway or if
L3 is shared and another core that uses the block does not
share or have a fast connection to that L2).

For forced inclusion, this could result in back invalidations
(removal from L1 caches) which is generally undesirable.
Undesirable back invalidations can be avoided by leaking LRU
information from L1 to L2 or by locking cache blocks in L2
that are in an L1 cache. Locking blocks can make LRU-based
replacement more complex (without locking, the victim is
always the least recently used--or perhaps the 'left-most'
invalid block; with locking more of the recency order must
be considered); leaking LRU information can require more
communication. With simple modulo indexing using the
physical address and adequate L2 associativity and a
single core with LRU replacement using L2, locking does not
violate global LRU. (With multiple cores sharing an L2,
there is the potential for a difference between global LRU
and local LRU. [This would also seem to apply for
instruction and data caches to a lesser degree.])

For an intermediate inclusion policy, an invalidation of an
L2 cache block could be handled as with an exclusive system
by communicating to the L1 that the block is no longer in
L2 or something more like an inclusive policy could be used
or some combination.

Not entirely unrelated is the suboptimal use of LRU-based
replacement when including frequency of use information in
replacement decisions would be better (when reuse distance
is not well predicted by recency of last use) and when
random or MRU replacement would be better (e.g., when all
blocks have equal reuse distance and there is insufficient
capacity as in some streaming access patterns--keeping
some of the blocks in cache to be reused is better than
being fair).

EricP

unread,

May 25, 2012, 1:20:42 PM5/25/12

to

As I understand it, current Intel caches use a "clock"
pseudo-LRU mechanism to track both L1 and L2 entries.
(I believe IBM uses the "binary tree" pseudo-LRU mechanism.)
In clock, each Way in the Set has a Used bit. As each Way entry is accessed,
its Used bit is set. If a Way is accessed and all Used bits are already
set, then all Set bits are cleared and the Used bit for that Way is set.
When we need a victim Way entry, we select one with a clear Used bit.
The cost is just 1 flip-flop per Way, plus some glue logic.

This works fine for L1 as it can mark the "clock" for each access.
However an inclusive L2 cache does not see the L1 access and has no
idea what was recently used. It tracks LRU based on when L2 was loaded.
So when L2 selects a victim, it selects the least recently loaded into L2,
which is not necessarily the best choice.

As I suggested in this group previously, for fully inclusive or
partial (NI/NE) caches, L2 should track Least Recently Evicted from L1.
That is, when L1 evicts an entry it informs L2, and L2 marks that
entry Used in its clock. When an entry is copied from L2 to L1,
it is also marked as Used. The net result is a variation on the
second chance pseudo-LRU page replacement algorithm.

For an exclusive cache, entries added to L2 would, by definition, be L1
evictions (ignoring L2 prefetches) and should be marked when added to L2.

Prefetches to L2 but not L1 kinda buggers things up.
You don't want prefetch values to evict anything useful,
but the pseudo-clock doesn't provide enough tracking precision
to determine how old each entry really is.
I'd kinda like a prefetch value to start out half-aged.

Eric

MitchAlsup

unread,

May 25, 2012, 5:34:13 PM5/25/12

to

At the nieve acedemic level, yes. Most of the time designers of real
chips do things a bit differently to avoid these kinds of problems.

Also consider that a code loop that is running has such great thit
rates in the instruction cache it may not bereplaced for a long time;
but when it is, its L2 and L3 cache images have disappered.

In more than one project I was working on, we would send messages from
the L1 to L2 to attempt to avoid these losses.

Mitch

Andy (Super) Glew

unread,

May 26, 2012, 7:36:06 PM5/26/12

to

On 5/25/2012 10:20 AM, EricP wrote:
> Paul A. Clayton wrote:
>> On May 25, 6:16 am, v...@t.com wrote:
>>> Normally, there are several levels of caching in every memory
>>> architecture. Thereby, L1 is backed by L2. Suppose one item is the
>>> most used one. It is therefore kept in L1 and, therefore, very rarely

>>> requested from L2. It is a first candidate for eviction out of L2! The

>>> most used item is evicted first!
>>>
>>> Is it how the stuff works?

Others have responded accurately. I'll pile on, and explain in terms I
am familiar with:

I won't talk about exclusive caches.

I'll also skip past LRU versus pseudo-LRU versus clock replacement, as
EricP described, handwaving that clock is supposed to be an
approximation to LRU.

For an inclusive cache, a naive L1/L2 system with LRU might work as you
describe. However, this performs badly. Some systems "trickle through"
information to update the L2 LRU, providing messages from L1 to L2 that
help keep the L2 more accurate.

Some systems may backwards query on an eviction from the L2. They may
have to do a backwards invalidate anyway, if evicting data that may be
in the L1. The backwards query gives the L1 the chance to say "Hey,
don't replace that line - I'm using it."

The backwards query doesn't need to be done at replacement time. It can
be done in advance.

Often you want the L2 cache to know which lines are in the L1, or in
which L1 if there are multiple L1s per L2. To do this accurately the L1
needs to inform the L2 when a line is no longer in use. This can
sometimes be done with just a few extra bits piggybacking on other
transactions. EricP already mentioned "replace the least recently
replaced from L1" policy.

Some older systems had the CPU or some inner controller know the shape,
sets and ways, of the outer cache. This way, when a line missed in the
L1, it knew which was the appropriate line in L2 to evict. This comes
very close to having the cache controller maintain LRU for both the L1
and L2. But it requires a cache controller that manages both L1 and L2.
Works less well with multiple L1s per L2. And requires that the L1
cache controller know the shape of the L2, which can be hard for
organizations.

Finally, when you have NINE (Non Inclusive Non Exclusive) caches, or
accidentally inclusive, it doesn't matter so much if you evict a line
from the L2 that is currently in the L1. Such caches allow this - they
may allow the line in the L1 to continue to be used, even though not in
the L2. If the line is eventually evicted from the L1, it can be
(re)allocated in the L2 if necessary. Often, it is not necessary.

MitchAlsup

unread,

May 27, 2012, 3:11:04 PM5/27/12

to an...@spam.comp-arch.net

On Saturday, May 26, 2012 6:36:06 PM UTC-5, Andy (Super) Glew wrote:
> Often you want the L2 cache to know which lines are in the L1, or in
> which L1 if there are multiple L1s per L2. To do this accurately the L1
> needs to inform the L2 when a line is no longer in use.

Piling on being so much fun..........

I have done several L1/L2 systems that had the L2 cache filter snoop
requests from the L1s. MOESI only uses 5 of the 8 possible states,
so I use 2 of these states to represent the cases where the L2 knew
that an L1 contained the line and snoops would be directed forward.
One state recorded that the L1 had not obtained write permission, while
another indicated that it had. So, in one case the snoop forwarding
was converted into an invalidate, in the other the snoop was replayed
at the L1.

These were inclusive cache hierarchies.

Mitch

Andy (Super) Glew

unread,

May 28, 2012, 1:51:18 PM5/28/12

to

I meant to say something about snoop or probe filtering.

Non-inclusive non-exclusive caches are most practical when you have only
one or a few cores. E.g. the P6 family L2 cache is NINE wrt the L1
cache, since in most P6es there is one L1 D$ and one L1 I$ per L2.
Snooping them is not so expensive. (Plus, it is natural to have an L2 at
the place where the D$ and I$ traffic converge to go outside the core.)

When you have many cores using the same snoop filter, NINE makes less
sense. Each snoop may go to 4, 8, or 16 places, so you can save a lot
of power by eliminating the snoops with an inclusive filter that is
guaranteed to be correct when it decides not to send a snoop to the
inner caches.

E.g. the recent Intels LLC (last Level, L3, $s) which are inclusive of
the CPU cores. The CPU cores have an L2 rthat is NINE wrt the L1 D$.

However, note that I was trying to carefully say "inclusive snoop
filter", rather than inclusive cache. You can have snoop filters which
are not caches, and vice versa. A snoop filter is a lot cheaper than a
cache - just a few bits per coherency block, rather than 64 bytes. There
are other advantages to data-less snoop filters, which are really
data-less caches for cache line status: e.g. one of my obsessions,
eliminating unnecessary RFO traffic, often 20% of all data traffic, but
implementing "ownership" of lines that are not in the cache.

However, this just pushes of the outer cache LRU issue.

It permits the data caches to be NINE, which has advantages as I
mentioned for LRU tracking.

But you may still have an LRU tracking issue between the data-less snoop
filter and the inner caches. Backwards invalidating or evicting the
inner caches when a snoop filter line is replaced is complicated and
expensive. To avoid such complexity, such snoop filters can be
accidentally inclusive: if an address is not in the snoop filter, you
may have to snoop. Or fetch accurate snoop filter data from a coherency
status tracking store, basically a directory.

nm...@cam.ac.uk

unread,

May 28, 2012, 2:06:47 PM5/28/12

to

In article <4FC3BB16...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>
>When you have many cores using the same snoop filter, NINE makes less
>sense. Each snoop may go to 4, 8, or 16 places, so you can save a lot
>of power by eliminating the snoops with an inclusive filter that is
>guaranteed to be correct when it decides not to send a snoop to the
>inner caches.

The subtleties of cache design are not my area, but the problems I
have seen with that number of cores (mainly on multi-socket systems)
tend to be conflict. If all of your cores are doing similar things,
the amount of 'junk' traffic gets so large that the whole system
slows down, even if there is no actual conflict.

That was the cause of the (old) 8-socket Opteron problems, I believe
was of the (old) 4-socket Intel problems, and might have been on the
IBM POWER4 (they have never clarified exactly what it was).

My suspicion is that is as least as good a reason as power not to
try to solve the scalability problem with a sledgehammer.

Regards,
Nick Maclaren.

Joe keane

unread,

May 28, 2012, 3:05:06 PM5/28/12

to

In article <wcPvr.30761$6Y6....@newsfe19.iad>,

EricP <ThatWould...@thevillage.com> wrote:
>You don't want prefetch values to evict anything useful,
>but the pseudo-clock doesn't provide enough tracking precision
>to determine how old each entry really is.
>I'd kinda like a prefetch value to start out half-aged.

If there's two bits of state, then you have four levels. At each clock
step you decrease the 'priority'; if it was already priority 0, it gets
kicked out. This is close enough to real LRU that i doubt you would
improve it with more levels.

You can also do what you say, for example, a location that is really
accessed gets set to priority 3, while something that is brought in by
prefetch is set to priority 1. The prefetch should not get kicked out
any time soon, but it's less likely to interfere with data you really
want.

I've done this in software.

v...@t.com

unread,

May 29, 2012, 7:01:47 AM5/29/12

to

Thanks for the replies. It was interesting, despite I was interested
more from software side (JVM, if you like computer architecture).
There is problem of identity mapping in transparent persistence: the
same disk data (DB record) must be loaded into the same memory object
(RAM address). It turns out that transactions come to rescue: L2,
identity map: the DB key -> mem obj, is not allowed to evict anything
until transaction completes. I do not know how what are the guidelines
for inter-transactional caching.

EricP

unread,

May 29, 2012, 1:41:54 PM5/29/12

to

What's the difference between an L1 invalidate and an L1 snoop?

I am thinking that an L1 invalidate might not require an ACK
(L2 does a fire-and-forget) whereas an L1 snoop must await a reply.
I'm also assuming there are comms queues connecting L1 and L2,
so that all messages are async.

However an async fire-and-forget would seem to open the possibility
of a race condition if L1 tries to upgrade to write permission at the
same time. So it looks like they both require ACKs.

Eric

nm...@cam.ac.uk

unread,

May 29, 2012, 2:25:11 PM5/29/12

to

In article <mU7xr.23309$x11....@newsfe21.iad>,

EricP <ThatWould...@thevillage.com> wrote:
>
>I am thinking that an L1 invalidate might not require an ACK
>(L2 does a fire-and-forget) whereas an L1 snoop must await a reply.
>I'm also assuming there are comms queues connecting L1 and L2,
>so that all messages are async.
>
>However an async fire-and-forget would seem to open the possibility
>of a race condition if L1 tries to upgrade to write permission at the
>same time. So it looks like they both require ACKs.

There have been a lot of attempts to produce distributed 'memory'
management (often in the database area) using time-stamped or at
least sequenced fire-and-forget. But, in the absence of a global
clock that is synchronised to better than the 'memory access'
latency, I have never seen a design that actually held water.

And note that the latency in that sense is the serial latency
divided by the number of agents. That can be alleviated by
suitable 'tie-breaking' logic, but almost all of those have very
nasty properties.

It's not my area, and I might well have missed something, but I
suspect that it's a lost cause. Well, more than suspect. My
attempts at mathematical analysis of the abstract problem have
convinced me that it's insoluble - though my analysis is far from
watertight. The point is that it's another aspect of the shared
state consistency morass, which is what I was analysing.

This is why I regard the only viable solution is to have language
support for aliasing, synchronisation and consistency, and use
language constraints to keep the problem in a soluble subset.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

May 29, 2012, 3:27:51 PM5/29/12

to

On May 28, 1:51 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> However, note that I was trying to carefully say "inclusive snoop
> filter", rather than inclusive cache. You can have snoop filters which
> are not caches, and vice versa. A snoop filter is a lot cheaper than a
> cache - just a few bits per coherency block, rather than 64 bytes. There
> are other advantages to data-less snoop filters, which are really
> data-less caches for cache line status: e.g. one of my obsessions,
> eliminating unnecessary RFO traffic, often 20% of all data traffic, but
> implementing "ownership" of lines that are not in the cache.

Providing extra tags would allow for a lack of data inclusion
while still providing tag inclusion. Extra tags can also be used
to support compressed cache blocks (or selective use of smaller
cache blocks), enhanced replacement (recognizing when a block was
recently evicted) or prefetching (unit stride access patterns could
be checked if recently evicted, extra metadata could be associated
with dataless tag entries, tags might be referenced indirectly by
the prefetch engine [i.e., compression at the prefetch engine]), or
variable associativity ("The V-Way Cache: Demand-Based
Associativity via Global Replacement", Qureshi, Thompson, Patt,
2005). (There are probably other uses. E.g., zero blocks could
be stored as dataless tags. With indirection--NUCA or V-Way--,
more general deduplication might be practical.)

Variable memory capacity and error coverage might also allow a
slight increase in the number of tag entries available.
Predictive entries might use tags compressed by pointing to
TLB entries. (If the TLB entry eviction problem could be
handled well in most cases--and always correctly--, optional
compression of tags could be used. E.g., 32 bits could
store one physical address tag entry or two or more tags
using a TLB index and version.)

(Just babbling out some thoughts.)

Paul A. Clayton

unread,

May 29, 2012, 5:06:10 PM5/29/12

to

On May 29, 1:41 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> MitchAlsup wrote:
> > On Saturday, May 26, 2012 6:36:06 PM UTC-5, Andy (Super) Glew wrote:
> >> Often you want the L2 cache to know which lines are in the L1, or in
> >> which L1 if there are multiple L1s per L2. To do this accurately the L1
> >> needs to inform the L2 when a line is no longer in use.
>
> > Piling on being so much fun..........
>
> > I have done several L1/L2 systems that had the L2 cache filter snoop
> > requests from the L1s. MOESI only uses 5 of the 8 possible states,
> > so I use 2 of these states to represent the cases where the L2 knew
> > that an L1 contained the line and snoops would be directed forward.
> > One state recorded that the L1 had not obtained write permission, while
> > another indicated that it had. So, in one case the snoop forwarding
> > was converted into an invalidate, in the other the snoop was replayed
> > at the L1.
>
> > These were inclusive cache hierarchies.
>
> > Mitch
>
> What's the difference between an L1 invalidate and an L1 snoop?

In the context, a snoop might return data (i.e., the L1 "obtained
write permission"--if it wrote data, the new data must be returned
otherwise the inclusive L2 could supply the data).

> I am thinking that an L1 invalidate might not require an ACK
> (L2 does a fire-and-forget) whereas an L1 snoop must await a reply.
> I'm also assuming there are comms queues connecting L1 and L2,
> so that all messages are async.

If latency was not important relative to interconnect bandwidth,
one could avoid some ACKs even with queuing by using implicit
ACK after a given time and requiring NACKs from the L1 if the
timing requirement could not be met. The implicit ACK could
indicate a miss or the confirmation of the L2's prediction of
the L1 status. I doubt this would be practical (complexity
and latency).

(One could also theoretically merge ACK responses to conserve
bandwidth.)

> However an async fire-and-forget would seem to open the possibility
> of a race condition if L1 tries to upgrade to write permission at the
> same time. So it looks like they both require ACKs.

I receive the impression the Mitch's system did not allow
L1 to get write permission without informing L2, so I do not
think a fire-and-forget invalidate would be a problem (when
the L1 did not have write permission).

However, I may not be thinking clearly.

EricP

unread,

May 30, 2012, 12:16:54 PM5/30/12

to

Paul A. Clayton wrote:
> On May 29, 1:41 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>> MitchAlsup wrote:
>>> On Saturday, May 26, 2012 6:36:06 PM UTC-5, Andy (Super) Glew wrote:
>>>> Often you want the L2 cache to know which lines are in the L1, or in
>>>> which L1 if there are multiple L1s per L2. To do this accurately the L1
>>>> needs to inform the L2 when a line is no longer in use.
>>> Piling on being so much fun..........
>>> I have done several L1/L2 systems that had the L2 cache filter snoop
>>> requests from the L1s. MOESI only uses 5 of the 8 possible states,
>>> so I use 2 of these states to represent the cases where the L2 knew
>>> that an L1 contained the line and snoops would be directed forward.
>>> One state recorded that the L1 had not obtained write permission, while
>>> another indicated that it had. So, in one case the snoop forwarding
>>> was converted into an invalidate, in the other the snoop was replayed
>>> at the L1.
>>> These were inclusive cache hierarchies.
>>> Mitch
>> What's the difference between an L1 invalidate and an L1 snoop?
>

>> However an async fire-and-forget would seem to open the possibility
>> of a race condition if L1 tries to upgrade to write permission at the
>> same time. So it looks like they both require ACKs.
>
> I receive the impression the Mitch's system did not allow
> L1 to get write permission without informing L2, so I do not
> think a fire-and-forget invalidate would be a problem (when
> the L1 did not have write permission).

Rather than say "what's the difference" I should have asked
whats the advantage of having both L1 invalidate and a snoop,
rather than just a snoop-invalidate?

I realized after sending the prior post that the difference was
not no-ack vs ack, as I don't think the ack can be eliminated,
but little-ack vs big-ack.

If a cache line is 64 bytes, 512 bits, but the comms channel is only,
say, 128 bits wide (plus sundry control bits), then a simple
invalidate ack would fit into 1 "packet", whereas a snoop transfers
the line data plus state info, and takes at least 4 packets.
The 4 packets requires more clocks to send, and more free resources
in the queue and is therefore more likely to block.

(Just guessing out loud.)

Eric

EricP

unread,

May 30, 2012, 12:50:48 PM5/30/12

to

EricP wrote:
>
> If a cache line is 64 bytes, 512 bits, but the comms channel is only,
> say, 128 bits wide (plus sundry control bits), then a simple
> invalidate ack would fit into 1 "packet", whereas a snoop transfers
> the line data plus state info, and takes at least 4 packets.
> The 4 packets requires more clocks to send, and more free resources
> in the queue and is therefore more likely to block.

Also the comms queue, which I assume is an SRAM with separate
write and read ports, might not be flat but be asymmetrical shaped.
That is, rather than being, say, 160 bits (128 data plus 32 control)
wide by 16 entries, you could save space by separating the control
and data areas. Instead one area might have 16 entries * 32 control bits,
another area 8 data entries * 128 data bits.

A packet that only contained an ACK would use just 1 control entry,
whereas a snoop-invalidate reply takes 1 control + 4 data.

Eric

MitchAlsup

unread,

May 30, 2012, 3:02:48 PM5/30/12

to

On Tuesday, May 29, 2012 12:41:54 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, May 26, 2012 6:36:06 PM UTC-5, Andy (Super) Glew wrote:
> >> Often you want the L2 cache to know which lines are in the L1, or in
> >> which L1 if there are multiple L1s per L2. To do this accurately the L1
> >> needs to inform the L2 when a line is no longer in use.
> >
> > Piling on being so much fun..........
> >
> > I have done several L1/L2 systems that had the L2 cache filter snoop
> > requests from the L1s. MOESI only uses 5 of the 8 possible states,
> > so I use 2 of these states to represent the cases where the L2 knew
> > that an L1 contained the line and snoops would be directed forward.
> > One state recorded that the L1 had not obtained write permission, while
> > another indicated that it had. So, in one case the snoop forwarding
> > was converted into an invalidate, in the other the snoop was replayed
> > at the L1.
> >
> > These were inclusive cache hierarchies.
> >
> > Mitch
>
> What's the difference between an L1 invalidate and an L1 snoop?

An L1 Inval was doen when the L1 data did not have write permission (E or M states) and therefore could not have been modified.

An L1 snoop was doen when the data could have been modified but the L2 had no way of knowing whether it had been or not.

Mitch

Paul A. Clayton

unread,

May 30, 2012, 3:42:41 PM5/30/12

to

On May 30, 12:16 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
[snip unnecessary, weak presentation of snoop vs. invalidate]

> Rather than say "what's the difference" I should have asked
> whats the advantage of having both L1 invalidate and a snoop,
> rather than just a snoop-invalidate?

A simple invalidation request could have an advantage in
terms of the scheduling of buffer use, since an invalidation
is guaranteed not to require a data return. (It might also
facilitate more optimal scheduling of tag accesses, if it
is okay to delay the invalidation.)

The L2 has three requests: invalidate this cache line (for
a [predicted] write into a cache line which L1 is known
not to have written into the cache line), return this
cache line if dirty and update to shared status, and
return this this cache line if dirty and invalidate the
L1 copy. (I think that is correct.)

> I realized after sending the prior post that the difference was
> not no-ack vs ack, as I don't think the ack can be eliminated,
> but little-ack vs big-ack.

Why would an ACK be necessary for an invalidation? As
long as the recipient L1 is guaranteed to process the
invalidation within the ordering constraints of the
Architecture (or provide a NACK sufficiently quickly).
(I am guessing that using an ACK is simpler and more
likely to be correct.)

> If a cache line is 64 bytes, 512 bits, but the comms channel is only,
> say, 128 bits wide (plus sundry control bits), then a simple
> invalidate ack would fit into 1 "packet", whereas a snoop transfers
> the line data plus state info, and takes at least 4 packets.
> The 4 packets requires more clocks to send, and more free resources
> in the queue and is therefore more likely to block.

A snoop (in the presented, inclusive design) would not
necessarily return data. The L2 is using the snoop to
ask the L1 if it actually modified the cache line (i.e.,
used its write permission) and if so to return the data.

If read-for-ownership can be avoided (block allocate/wh64,
block zero, or even write coalescing), then even a dirty
cache line can be simply invalidated.

EricP

unread,

May 30, 2012, 3:54:00 PM5/30/12

to

I understood that, thanks.
I was trying to understand the advantage of introducing
such a distinction in the first place.

Without considering the size of the data being moved,
they look to have the same cost, handshake wise.
But then I remember... oh yeah, a cache line is 512 bits.
Not moving that saves power, even if you can do it in parallel
in a single clock. If for some reason you don't want to run, say,
1200 parallel wires for the comms queues between the caches
(600 L1->L2, 600 L2->L1) then you have to split up the line
into multiple packets and it takes more clocks.

Anyway... that's what I was pondering about.

Eric

Paul A. Clayton

unread,

May 30, 2012, 4:11:36 PM5/30/12

to

The separation of control and data information does seem
like it could have some advantages. (The data buffer could
potentially be shared by outgoing and incoming data even if
the outgoing and incoming commands used separate buffers.
Also, since data transfer takes time, the command will
likely be active for noticeably longer than the data will
be buffered, so late allocation of data buffer entries
might be practical. Zeroed cache lines could be compressed
to not use the data buffer.)

I do not know how much control overhead there is. Responses
could use a transaction number rather than a full address (I
think).

I suspect it would be very tempting to fill the entire channel
width, possibly by allowing two short messages to share the
width for one "beat". One could imagine an ACK from an
incoming request piggybacking on an outgoing request or a
larger response, a little like TCP.

MitchAlsup

unread,

May 30, 2012, 6:20:56 PM5/30/12

to

The already existing L2 tags provide the only required store
to use the L2 as a snoop filter for the L1s.

> Without considering the size of the data being moved,
> they look to have the same cost, handshake wise.
> But then I remember... oh yeah, a cache line is 512 bits.

Consider the situation where the L2 has one size cache line (512)
while the L1 has another cache line size (say 128).

Mitch

Joe keane

unread,

May 31, 2012, 4:54:18 PM5/31/12

to

In article <f9bd2881-a7fb-42a9...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>These were inclusive cache hierarchies.
>
>Mitch

What if the L2 line size is bigger than the L1 line size?

MitchAlsup

unread,

May 31, 2012, 5:50:30 PM5/31/12

to

The L2 line size IS bigger than the L1 line size!

So reask your question.

Mitch

Joe keane

unread,

May 31, 2012, 7:02:59 PM5/31/12

to

In article <6cc646d6-4758-4ad4...@googlegroups.com>,
MitchAlsup <Mitch...@aol.com> wrote:
>So reask your question.

Are there any tricks that seem like a good idea when the line sizes are
the same, but fall apart if this assumption is not true?

Or is any method that one would think is practical to implement
sufficiently general to handle this case?

[for specificity, suppose every core has its own L2 but cores on the
same chip share the L3]

MitchAlsup

unread,

Jun 1, 2012, 11:30:14 AM6/1/12

to

On Thursday, May 31, 2012 6:02:59 PM UTC-5, Joe keane wrote:
> In article <6cc646d6-4758-4ad4...@googlegroups.com>,
> MitchAlsup <Mitch...@aol.com> wrote:
> >So reask your question.
>
> Are there any tricks that seem like a good idea when the line sizes are
> the same, but fall apart if this assumption is not true?

Well, we did the L1 different than L2 because we could transfer an
L1 cache line both ways per clock to the L2 (i.e. no bursting).
This made the number of cycles spent waiting for L2 response to
an L1 miss go down because one did not wait for the current miss
to burst the rest of his data up before sending the second miss data.
It also aleviates the need for crtical word first on this interface.
Overall this was 5%-ish faster than the bursting version on L2
intense codes.

So, its a microarchitectural thing, not a cache line size thing.

> Or is any method that one would think is practical to implement
> sufficiently general to handle this case?

I think you deal with this on an implementation by implementation
basis.

> [for specificity, suppose every core has its own L2 but cores on the
> same chip share the L3]

Overall you trade off responsiviness for bandwidth. If the L2 is
big enough so that the L3 is less than 50% saturated, it probably
dosn't mater a lot whether you burst the whole line of piece meal
the parts in and out.

Mitch