Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Cache sizes on Ice Lake

220 views
Skip to first unread message

Anton Ertl

unread,
Oct 24, 2018, 5:00:11 AM10/24/18
to
According to <https://browser.geekbench.com/v4/cpu/10445533>, Ice Lake
has 48KB D-cache, but still 32KB I-cache. Given that the D-Cache size
in Intel mainline CPUs has been 32KB since Yonah in 2006, and AMD,
too, has settled on 32KB since Excavator in 2015 (after uing 64KB in
K7-K10 (1999-2011) and 16KB in Bulldozer-Steamroller (2011-2015)),
this is an interesting development. Will the latency increase?

It is also interesting that the I-cache stays at 32KB. Given that it
is actually the second level after the Decoded Stream Buffer (1.5k
uOps in Skylake, no data for Ice Lake yet), the latency of the I-cache
should not be that important, and the hit rate more relevant, so I
would expect it to be larger (Ryzen has 64KB I-cache on top of a 2K
entry uop cache).

The L2 cache is 512KB (midway between Skylake and Skylake-X); the L3
cache on the tested sample is 2MB/core, like in Skylake (and mainline
CPUs since Nehalem; some of that is often disabled), unlike Skylake-X.
The L3 on Skylake-X is reported to be a non-inclusive victim cache
(while on Skylake it is an inclusive cache). What will the Ice Lake
be? Given the sizes, I expect it to be like Skylake rather than
Skylake-X, but OTOH, the Ryzen L3 with the same sizes as the Ice Lake
sample is a mostly exclusive victim cache.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Terje Mathisen

unread,
Oct 24, 2018, 5:42:38 AM10/24/18
to
Anton Ertl wrote:
> According to <https://browser.geekbench.com/v4/cpu/10445533>, Ice Lake
> has 48KB D-cache, but still 32KB I-cache. Given that the D-Cache size
> in Intel mainline CPUs has been 32KB since Yonah in 2006, and AMD,
> too, has settled on 32KB since Excavator in 2015 (after uing 64KB in
> K7-K10 (1999-2011) and 16KB in Bulldozer-Steamroller (2011-2015)),
> this is an interesting development. Will the latency increase?

Yes, but possibly not enough to cause any additional cycles to be needed?
>
> It is also interesting that the I-cache stays at 32KB. Given that it
> is actually the second level after the Decoded Stream Buffer (1.5k
> uOps in Skylake, no data for Ice Lake yet), the latency of the I-cache
> should not be that important, and the hit rate more relevant, so I
> would expect it to be larger (Ryzen has 64KB I-cache on top of a 2K
> entry uop cache).

Isn't the working set sizes for code split in two?

Eitehr you can do very nicely with a smallish cache, or you need many
MBs (i.e. databases)?

>
> The L2 cache is 512KB (midway between Skylake and Skylake-X); the L3
> cache on the tested sample is 2MB/core, like in Skylake (and mainline
> CPUs since Nehalem; some of that is often disabled), unlike Skylake-X.
> The L3 on Skylake-X is reported to be a non-inclusive victim cache
> (while on Skylake it is an inclusive cache). What will the Ice Lake
> be? Given the sizes, I expect it to be like Skylake rather than
> Skylake-X, but OTOH, the Ryzen L3 with the same sizes as the Ice Lake
> sample is a mostly exclusive victim cache.

I would worry that switching from their traditional inclusive caches to
an exclusive victim cache could cause som bugs/issues?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Niels Jørgen Kruse

unread,
Oct 24, 2018, 8:04:09 AM10/24/18
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> According to <https://browser.geekbench.com/v4/cpu/10445533>, Ice Lake
> has 48KB D-cache, but still 32KB I-cache. Given that the D-Cache size
> in Intel mainline CPUs has been 32KB since Yonah in 2006, and AMD,
> too, has settled on 32KB since Excavator in 2015 (after uing 64KB in
> K7-K10 (1999-2011) and 16KB in Bulldozer-Steamroller (2011-2015)),
> this is an interesting development. Will the latency increase?

Or Geekbench could just be wrong? Geekbench reports 32I+32D on Apples
A11, but Anandtech found 64I+64D.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Bruce Hoult

unread,
Oct 24, 2018, 8:14:36 AM10/24/18
to
That result is on Linux. I'd expect Geekbench to be as accurate as lscpu there -- which is pretty detailed e.g.

https://www.servethehome.com/wp-content/uploads/2018/10/AMD-Ryzen-Threadripper-2990WX-lscpu.jpg

MitchAlsup

unread,
Oct 24, 2018, 8:56:31 PM10/24/18
to
On Wednesday, October 24, 2018 at 4:00:11 AM UTC-5, Anton Ertl wrote:

I have always found the balance between L1 cache size and L1 access latency
to be a delicate balance.

A 1 cycle cache is necessarily small.
A 2 cycle cache can have 1/2 cycle of wire delay going in and coming out
with the same sized SRAM instances.

This is what leads to Alpha 21064 coming out with 8K cache, getting massaged
into 21164 with 16K cache and back and forth. As the frequency of the Data
path got faster the cches were squeezed in size. After the SRAM designers
got their analog act together the SRAM sizes could be increased.

With the 21264 the design added a whole clock cycle and cache sizes stabilized
at 64KB.

If you can't figure out how to absorb the added ache latency cycle, you do
everything in your power to accept the smaller cache size. If you can absorb
the latency, you use the largest ache you can afford (at that latency.)

Many benchmarks put up with rather small cache sizes, and a few simply beat
you up if you went too small (FPPPP or was that TOMCATV?) Database stuff
is quite unforgiving in I or D Cache size (actually the whole memory hier-
archy.)

Bruce Hoult

unread,
Oct 24, 2018, 9:37:15 PM10/24/18
to
In the embedded world people pay a lot of attention to the "CoreMark" benchmark that is promoted by ARM.

It turns out that if you use an ISA with mixed 16/32 bit instruction lengths (e.g. Thumb2 or RISC-V with the C extension) then CoreMark runs very well with 16 KB of icache or ITIM. If you use base RISC-V with only 32 bit opcodes then it doesn't, and I expect the same will be true for ARM32 or MIPS or others.

On the HiFive1 dev board it makes a difference of 2x although the penalty for an icache miss is admittedly unusually severe, having to go to SPI flash for the cache reload.

Dhrystone is smaller and fits either way.

Stephen Fuld

unread,
Oct 25, 2018, 2:18:29 AM10/25/18
to
On 10/24/2018 5:56 PM, MitchAlsup wrote:
> On Wednesday, October 24, 2018 at 4:00:11 AM UTC-5, Anton Ertl wrote:
>
> I have always found the balance between L1 cache size and L1 access latency
> to be a delicate balance.
>
> A 1 cycle cache is necessarily small.
> A 2 cycle cache can have 1/2 cycle of wire delay going in and coming out
> with the same sized SRAM instances.
>
> This is what leads to Alpha 21064 coming out with 8K cache, getting massaged
> into 21164 with 16K cache and back and forth. As the frequency of the Data
> path got faster the cches were squeezed in size.

But making the data paths faster was, in those days, equivalent to
making the transistors smaller. So you could fit the same sized cache
in a smaller physical footprint. I must be missing something, because,
as Anton has stated, for the Intel CPUs, the (L1 and L2) cache size
stayed the same over several generations of smaller feature size with
minimal increase in frequency. Why was that?




--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Terje Mathisen

unread,
Oct 25, 2018, 3:18:42 AM10/25/18
to
Wires got slower as they got thinner?

Anton Ertl

unread,
Oct 25, 2018, 4:14:53 AM10/25/18
to
Terje Mathisen <terje.m...@tmsw.no> writes:
>Anton Ertl wrote:
>> According to <https://browser.geekbench.com/v4/cpu/10445533>, Ice Lake
>> has 48KB D-cache, but still 32KB I-cache. Given that the D-Cache size
>> in Intel mainline CPUs has been 32KB since Yonah in 2006, and AMD,
>> too, has settled on 32KB since Excavator in 2015 (after uing 64KB in
>> K7-K10 (1999-2011) and 16KB in Bulldozer-Steamroller (2011-2015)),
>> this is an interesting development. Will the latency increase?
>
>Yes, but possibly not enough to cause any additional cycles to be needed?

Or, alternatively, they have to add another cycle (or make the D-cache
smaller than 32KB), but they then manage to access a 48KB D-cache in
that time. In the pretty good OoO CPUs of our times, the better hit
rate may help more than the longer latency hurts.

The latency of the P6-derived cores has gone up from 3 cycles
(P6-Penryn (Core 2)) to 4 cycles (Nehalem-Skylake) in 2008, while
keeping the D-cache size at 32KB (it had grown from P6 to Penryn). If
the additional cycle allowed a larger D-cache, why has Intel not gone
there before? Lack of competition?

>> It is also interesting that the I-cache stays at 32KB. Given that it
>> is actually the second level after the Decoded Stream Buffer (1.5k
>> uOps in Skylake, no data for Ice Lake yet), the latency of the I-cache
>> should not be that important, and the hit rate more relevant, so I
>> would expect it to be larger (Ryzen has 64KB I-cache on top of a 2K
>> entry uop cache).
>
>Isn't the working set sizes for code split in two?
>
>Eitehr you can do very nicely with a smallish cache, or you need many
>MBs (i.e. databases)?

AMD has stayed at 64KB I-cache since the K7 in 1999, so apparently
they perceive a value in having more then 32KB I-cache.

>> The L2 cache is 512KB (midway between Skylake and Skylake-X); the L3
>> cache on the tested sample is 2MB/core, like in Skylake (and mainline
>> CPUs since Nehalem; some of that is often disabled), unlike Skylake-X.
>> The L3 on Skylake-X is reported to be a non-inclusive victim cache
>> (while on Skylake it is an inclusive cache). What will the Ice Lake
>> be? Given the sizes, I expect it to be like Skylake rather than
>> Skylake-X, but OTOH, the Ryzen L3 with the same sizes as the Ice Lake
>> sample is a mostly exclusive victim cache.
>
>I would worry that switching from their traditional inclusive caches to
>an exclusive victim cache could cause som bugs/issues?

They have made that switch for the server CPUs already, so they are
obviously confident that they have this under control.

Stephen Fuld

unread,
Oct 25, 2018, 11:55:12 AM10/25/18
to
So I think you are positing that the benefit of smaller feature sizes
(less distance) is approximately balanced out by the cost of wires
getting slower as they got smaller/thinner. That could be.

But I thought the limit on faster clock rate was power/heat dissipation.
They shouldn't be a problem in caches.

So are the limits heat within the CPU proper but wire speed to/in the
caches?

Quadibloc

unread,
Oct 25, 2018, 12:06:34 PM10/25/18
to
On Thursday, October 25, 2018 at 12:18:29 AM UTC-6, Stephen Fuld wrote:
> I must be missing something, because,
> as Anton has stated, for the Intel CPUs, the (L1 and L2) cache size
> stayed the same over several generations of smaller feature size with
> minimal increase in frequency. Why was that?

Dennard Scaling, unlike Moores' Law, is dead.

So smaller feature size led to minimal increase in frequency in recent years,
and since there was minimal increase in frequency, there was no opportunity to
increase the size of the cache.

It was obscured, though, by the abandonment of high-frequency designs like the
Pentium 4 and Bulldozer, that frequency was still increasing with smaller
feature size, just at a much slower rate than before. This is why some
noticeable increases in frequency, and increases in cache size, have finally
become possible.

John Savard

Stephen Fuld

unread,
Oct 25, 2018, 12:19:42 PM10/25/18
to
On 10/25/2018 9:06 AM, Quadibloc wrote:
> On Thursday, October 25, 2018 at 12:18:29 AM UTC-6, Stephen Fuld wrote:
>> I must be missing something, because,
>> as Anton has stated, for the Intel CPUs, the (L1 and L2) cache size
>> stayed the same over several generations of smaller feature size with
>> minimal increase in frequency. Why was that?
>
> Dennard Scaling, unlike Moores' Law, is dead.
>
> So smaller feature size led to minimal increase in frequency in recent years,

Agreed.

> and since there was minimal increase in frequency, there was no opportunity to
> increase the size of the cache.


I don't see how that follows. All other things being equal, lower
frequency should allow larger caches. And given that they couldn't
increase frequency, there is incentive to increase cache size to improve
performance.

Chris M. Thomasson

unread,
Oct 25, 2018, 4:31:51 PM10/25/18
to
Is the L1 cache line 64 bytes and L2 line 128 bytes? A while back, I
seem to remember a L2 128 cache line being split into two 64 byte areas
wrt hyperthreading, along with an aliasing problem. One had to offset
the stacks of threads in a special way to get around false sharing.

Anton Ertl

unread,
Oct 26, 2018, 8:14:47 AM10/26/18
to
Stephen Fuld <SF...@alumni.cmu.edu.invalid> writes:
>So I think you are positing that the benefit of smaller feature sizes
>(less distance) is approximately balanced out by the cost of wires
>getting slower as they got smaller/thinner. That could be.

Yes, the explanation I heard for the end of the clock frequency
increases was, on one hand, that wire time does not decrease as
features get smaller: The length is shorter, but capacitance stays
about the same and keeps the wire time roughly the same. Once
transistor time was smaller than wire time, frequency scaling slowed
down.

>But I thought the limit on faster clock rate was power/heat dissipation.

That was the other issue. They were prepared to get higher clocks
with deeper pipelines (Intel Tejas, and Mitch Alsup's K9), but that
means more frequent switching of more transistors, i.e., high power
consumption and, (judging from GPUs) more importantly, more power
density. Both CPUs were cancelled.

> They shouldn't be a problem in caches.

Not sure about that. At one point someone from AMD explained that
they could not increase the cache of (IIRC) K8 (or was it a late K7?)
because it would consume too much power.

>So are the limits heat within the CPU proper but wire speed to/in the
>caches?

These days, there are CPUs where the uncore power exceeds the core
power
<https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/4>,
but power density is still bigger in the cores (only 8*7mm^2=56mm^2 of
the 212mm^2 Zeppelin die are in cores
<https://en.wikichip.org/wiki/amd/microarchitectures/zen>).

In any case, there are different aspects of speed relevant to this
discussion: latency and clock rate. For latency (in ns), the
transistor and wire delays play a big role (and, the larger the cache,
the more wire delay), while the clock rate of CPU cores has been
limited by heat.

Stephen Fuld

unread,
Oct 26, 2018, 11:56:22 AM10/26/18
to
On 10/26/2018 4:24 AM, Anton Ertl wrote:
> Stephen Fuld <SF...@alumni.cmu.edu.invalid> writes:
>> So I think you are positing that the benefit of smaller feature sizes
>> (less distance) is approximately balanced out by the cost of wires
>> getting slower as they got smaller/thinner. That could be.
>
> Yes, the explanation I heard for the end of the clock frequency
> increases was, on one hand, that wire time does not decrease as
> features get smaller: The length is shorter, but capacitance stays
> about the same and keeps the wire time roughly the same. Once
> transistor time was smaller than wire time, frequency scaling slowed
> down.
>
>> But I thought the limit on faster clock rate was power/heat dissipation.
>
> That was the other issue. They were prepared to get higher clocks
> with deeper pipelines (Intel Tejas, and Mitch Alsup's K9), but that
> means more frequent switching of more transistors, i.e., high power
> consumption and, (judging from GPUs) more importantly, more power
> density. Both CPUs were cancelled.
>
>> They shouldn't be a problem in caches.
>
> Not sure about that. At one point someone from AMD explained that
> they could not increase the cache of (IIRC) K8 (or was it a late K7?)
> because it would consume too much power.


OK, but I don't understand why. If the power use is roughly
proportional to the number of transistors switching, increasing the
cache size doesn't increase the number of transistors switching (except
in the sense that the higher hit rate allows less stall time for the
system). It is still one cache line for a hit, independent of the
number of "untouched" lines in a particular access. Again, I am far from
an expert in this area, so I may be way off base.



>
>> So are the limits heat within the CPU proper but wire speed to/in the
>> caches?
>
> These days, there are CPUs where the uncore power exceeds the core
> power
> <https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/4>,
> but power density is still bigger in the cores (only 8*7mm^2=56mm^2 of
> the 212mm^2 Zeppelin die are in cores
> <https://en.wikichip.org/wiki/amd/microarchitectures/zen>).
>
> In any case, there are different aspects of speed relevant to this
> discussion: latency and clock rate. For latency (in ns), the
> transistor and wire delays play a big role (and, the larger the cache,
> the more wire delay), while the clock rate of CPU cores has been
> limited by heat.

Yes, but my original question was that given the same latency, the wire
delay of a larger cache should be mitigated by the smaller size of the
individual transistors. i.e. more transistors within a given physical
space. That is what I don't understand.

Quadibloc

unread,
Oct 26, 2018, 12:02:36 PM10/26/18
to
On Friday, October 26, 2018 at 9:56:22 AM UTC-6, Stephen Fuld wrote:

> Yes, but my original question was that given the same latency, the wire
> delay of a larger cache should be mitigated by the smaller size of the
> individual transistors. i.e. more transistors within a given physical
> space. That is what I don't understand.

The wires in the cache are like the wires in the CPU.

Smaller feature sizes have made the CPU get smaller, yet it hasn't gotten
faster.

So the wires got shorter: but because everything got smaller, the wires got
*thinner*. Therefore, signals go down the wires *more slowly* than before, not
at the same speed. One has to worry about resistance and inductance, not just
capacitance, basically.

John Savard

Stephen Fuld

unread,
Oct 26, 2018, 12:45:53 PM10/26/18
to
On 10/26/2018 9:02 AM, Quadibloc wrote:
> On Friday, October 26, 2018 at 9:56:22 AM UTC-6, Stephen Fuld wrote:
>
>> Yes, but my original question was that given the same latency, the wire
>> delay of a larger cache should be mitigated by the smaller size of the
>> individual transistors. i.e. more transistors within a given physical
>> space. That is what I don't understand.
>
> The wires in the cache are like the wires in the CPU.


Yes

> Smaller feature sizes have made the CPU get smaller, yet it hasn't gotten
> faster.

Yes. My belief at least was that this was primarily due to power/heat
limitations.


> So the wires got shorter: but because everything got smaller, the wires got
> *thinner*. Therefore, signals go down the wires *more slowly* than before, not
> at the same speed. One has to worry about resistance and inductance, not just
> capacitance, basically.


OK, I guess. I think I have to do some research on how the speed of
signal propagation is affected by these factors. I thought the speed
was only determined by the material, and perhaps the geometry (amount of
surface area). Perhaps I was wrong.

MitchAlsup

unread,
Oct 26, 2018, 1:17:16 PM10/26/18
to
When one increases the size of the cache but not the size of the page
one ends up having more sets of associativity in the cache to avoid an
aliasing problem. Thus, as the 1st level cache gets bigger, it gets
more power intensive.

In the case of Athlon/Opteron, the 64KB caches were 4-way banked and
4-way setted in order to support 2 accesses per cycle, and the TLB
was organized along with the cache tags so that both the PTE data and
cache tag data arrived at an analog XOR gate with the sense amps reading
if both agreed with each other.

Now, making caches bigger but reading only a single set after determining
which set to read after tag comparisons is way lower in power--but way
longer in latency. In K9 we even ran the columns of the L2 in analog
using tag hit wires individually as SRAM enables. A column took 3 cycles
1 was wire delay inm, 1 was access, and one was wire delay out. There
were 16 SRAM instances on each column; and this was after the 2 cycle
cache tag lookup. Wire delay in and out of the L2 was 5-ish cycles. So,
in the latency of L2 access (20 cycles) one could service 4 L2 requests
(of 5 cycles each).

MitchAlsup

unread,
Oct 26, 2018, 1:24:14 PM10/26/18
to
On Friday, October 26, 2018 at 10:56:22 AM UTC-5, Stephen Fuld wrote:
At 120nm (before wires got really slow) one could traverse 1mm of wire in
1ns from sending end to receiving end.
At 12nm (after wires got really slow) one could only traverse 0.03mm of wire
in 1ns.

So the wires are 3X slower even AFTER correcting for the gain in length (10x).
If you needed to go 1mm, the wire in 12nm will take 30X as long as the wire
in 120nm. At the same time, the SRAM went from 2ns cycle time to 0.5ns
cycle time. Wires got slower faster than transistor area got smaller.

And this is why wires at the top of the wire stack are invariably larger in
cross sectional area. These layers are for wires that travel significant
distances (They might be used for wires into and out of the Ln cache; but
they will NOT be used for more local SRAM to flip-flop wiring.)

Stephen Fuld

unread,
Oct 26, 2018, 2:10:23 PM10/26/18
to
Thank you, Mitch. This was exactly the information I was missing. As
usual, you provided an informative and authoritative answer.


> And this is why wires at the top of the wire stack are invariably larger in
> cross sectional area. These layers are for wires that travel significant
> distances (They might be used for wires into and out of the Ln cache; but
> they will NOT be used for more local SRAM to flip-flop wiring.)
>


Paul A. Clayton

unread,
Oct 27, 2018, 1:14:37 AM10/27/18
to
On Friday, October 26, 2018 at 1:17:16 PM UTC-4, MitchAlsup wrote:
[snip]
> When one increases the size of the cache but not the size of the page
> one ends up having more sets of associativity in the cache to avoid an
> aliasing problem. Thus, as the 1st level cache gets bigger, it gets
> more power intensive.
>
> In the case of Athlon/Opteron, the 64KB caches were 4-way banked and
> 4-way setted in order to support 2 accesses per cycle, and the TLB
> was organized along with the cache tags so that both the PTE data and
> cache tag data arrived at an analog XOR gate with the sense amps reading
> if both agreed with each other.

ISTR that Athlon's 64 KiB L1 caches were only two-way
associative (and used a duplicate tag set to probe aliasable
ways on a cache miss). Probing alternate ways on a cache miss
(with optional filtering and other throughput/energy
optimizations) is one way to avoid the product of page size
and associativity cache size limit. A tag-inclusive L2 cache (which can be helpful in filtering coherence traffic)
indicating the set (or even specific index) would also avoid
the associativity constraint.

Way predication can reduce the access energy cost of higher
associaitivity. Predicting the least significant bits of the
physical address seems a closely related technique for
avoiding the associativity constraint (one could view way
group allocation as being determined by these bits, so the
prediction is a way group prediction).

Other techniques for handling aliases (aside from just not
allowing them) have been proposed.

What I find disappointing is that neither clustering nor
sub-block NUCA have been adopted to reduce latency. While
clustering has the obvious issues of cross-cluster
communication (where the load-to-use-for-load latency may
include the latency of accessing a different cluster's cache)
and layout implications, such would sometimes have lower
latency. Sub-block NUCA (whether of the least significant bits
of every word sufficient to provide indexing and perhaps even
help in way prediction (effectively width pipelining cache
access) or of whole selected, possibly dynamically, words)
would allow lower load-to-use latency.

Qualcomm's Falkor *might* be considered to have a block-based
NUCA L1 instruction cache if one considers the L0 instruction
cache as the fast portion of L1. (L0 and L1 are described as
exclusive, but the allocation and transfer policy does not
appear to be described publicly. If blocks are sometimes
allocated to L1 (not always promoted to L0) and blocks evicted
from L0 may bypass L1, then such would be rather NUCA-ish.

(Instruction caches offer interesting width pipelining
options. For instruction address relative and absolute
immediate branches, a BTB could be viewed as a (typically
redundant storage) form of width pipelining. With 16-byte
aligned fetches, a cache might be designed with the first
8-byte chunk in a closer portion and the second in a more
distant portion, allowing some of the instruction data to
have lower latency. Predecoding could bias critical
instruction information placement into the faster half, but
even without biases such NUCA could be similar to a
branch target instruction cache. A trace cache with two
fetch width long traces (like the Pentium 4) might be able
to have a two-cycle redirect latency with a nearly 3-cycle-
sized trace cache by fetching the first chunk from a half-
capacity fast region.)

(Width pipelining methods could also be applied to memory accesses, allowing the first beat of a DRAM access to be
used early while having a wide interface and moderate extra
tracking complexity. Laying out data in memory according to
expected criticality could reduce latency (at the cost of a
little extra metadata).)

Anton Ertl

unread,
Oct 27, 2018, 1:05:16 PM10/27/18
to
MitchAlsup <Mitch...@aol.com> writes:
>On Friday, October 26, 2018 at 10:56:22 AM UTC-5, Stephen Fuld wrote:
>> On 10/26/2018 4:24 AM, Anton Ertl wrote:
>> > Not sure about that. At one point someone from AMD explained that
>> > they could not increase the cache of (IIRC) K8 (or was it a late K7?)
>> > because it would consume too much power.
>>
>>
>> OK, but I don't understand why. If the power use is roughly
>> proportional to the number of transistors switching, increasing the
>> cache size doesn't increase the number of transistors switching (except
>> in the sense that the higher hit rate allows less stall time for the
>> system). It is still one cache line for a hit, independent of the
>> number of "untouched" lines in a particular access. Again, I am far from
>> an expert in this area, so I may be way off base.
>
>When one increases the size of the cache but not the size of the page
>one ends up having more sets of associativity in the cache to avoid an
>aliasing problem. Thus, as the 1st level cache gets bigger, it gets
>more power intensive.

The cache under discussion was the L2 cache (at least that's how I
remember it), which is physically indexed, so page size does not play
a role. I did not understand why the cache size would cause such a
power draw, either. I guess it was something specific to their design
that was not easy to change.

timca...@aol.com

unread,
Oct 27, 2018, 9:41:24 PM10/27/18
to
On Friday, October 26, 2018 at 11:56:22 AM UTC-4, Stephen Fuld wrote:
> OK, but I don't understand why. If the power use is roughly
> proportional to the number of transistors switching, increasing the
> cache size doesn't increase the number of transistors switching (except
> in the sense that the higher hit rate allows less stall time for the
> system). It is still one cache line for a hit, independent of the
> number of "untouched" lines in a particular access. Again, I am far from
> an expert in this area, so I may be way off base.
>
Don't forget about leakage current, which also gets worse the smaller the feature size (although I believe FinFETs helped with that).

- Tim
0 new messages