Cache size on Ampere

Stefan Monnier

unread,

Dec 3, 2021, 2:05:41 PM12/3/21

to

I just bumped into the Ampere Altra's specifications and was struck: for
a processor with 250W TDP isn't a 32MB last-level cache small?

I mean 32MB was considered large back in 2012 (e.g. on the Itanium
9500), but nowadays AMD's Epyc CPUs come with 128-256MB of cache.

IIUC the Altra does have a 1MB per-core L2 cache, which is twice as
large as current Epyc's core, IIUC, but I wonder if someone here has
some intuition about why they'd go with such a small LLC.

Stefan

John Dallman

unread,

Dec 3, 2021, 5:07:21 PM12/3/21

to

> IIUC the Altra does have a 1MB per-core L2 cache, which is twice as
> large as current Epyc's core, IIUC, but I wonder if someone here has
> some intuition about why they'd go with such a small LLC.

The Altra Max is even odder, with 128 cores and a 16MB last-level cache.
It's a 7nm single-die device, and I suspect they were just short of chip
area. The Altra family are intended as cloud server processors, and the
assumption seems to be that all the cores will be running the same code,
with a fairly small working set to fit in the LLC.

The AMD devices are multi-die ("chiplets"), with no cache shared between
dies, which makes it easy to have a lot of LLC, but only part of it is
accessible to any given chiplet.

John

Anton Ertl

unread,

Dec 3, 2021, 6:04:07 PM12/3/21

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
>I just bumped into the Ampere Altra's specifications and was struck: for
>a processor with 250W TDP isn't a 32MB last-level cache small?

That's an odd relation. When I read that, I put it in relation to the
number of cores: 32MB L3 for 80 cores on the Altra Q80-33, and 16MB for 128
cores on the Altra Max M128-30.

>IIUC the Altra does have a 1MB per-core L2 cache, which is twice as
>large as current Epyc's core, IIUC, but I wonder if someone here has
>some intuition about why they'd go with such a small LLC.

Apparently they think that the 1MB L2 is big enough for their
customers, and they use the L3 only as a victim cache and for
communicating between cores. There seem to be enough usage patterns
that do not need that much cache; AMD has announced that they will
offer server CPUs (Bergamo) with more cores and less cache
<https://www.anandtech.com/show/17055/amd-gives-details-on-epyc-zen4-genoa-and-bergamo-up-to-96-and-128-cores>
in addition to the ones with a similar balance to their current
offerings.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Stefan Monnier

unread,

Dec 3, 2021, 8:09:29 PM12/3/21

to

>>I just bumped into the Ampere Altra's specifications and was struck: for
>>a processor with 250W TDP isn't a 32MB last-level cache small?
> That's an odd relation. When I read that, I put it in relation to the
> number of cores: 32MB L3 for 80 cores on the Altra Q80-33, and 16MB for 128
> cores on the Altra Max M128-30.

I find TDP to be a better approximation of "total word done per second"
than the number of cores or the CPU's frequency. I'm not claiming it's
perfect, by a long shot, but I think it's meaningful.

> Apparently they think that the 1MB L2 is big enough for their
> customers, and they use the L3 only as a victim cache and for
> communicating between cores. There seem to be enough usage patterns
> that do not need that much cache; AMD has announced that they will
> offer server CPUs (Bergamo) with more cores and less cache

I was wondering if their decision was also linked to the
microarchitecture of the cores themselves. E.g. they have too few
in-flight instructions to withstand an L2-miss-L3-hit without stalling,
and that in turn (somehow) makes a big LLC less beneficial (maybe not in
terms of single-thread performance but in terms of overall consumption)?

Stefan

Anton Ertl

unread,

Dec 4, 2021, 1:46:47 PM12/4/21

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
>>>I just bumped into the Ampere Altra's specifications and was struck: for
>>>a processor with 250W TDP isn't a 32MB last-level cache small?
>> That's an odd relation. When I read that, I put it in relation to the
>> number of cores: 32MB L3 for 80 cores on the Altra Q80-33, and 16MB for 128
>> cores on the Altra Max M128-30.
>
>I find TDP to be a better approximation of "total word done per second"
>than the number of cores or the CPU's frequency. I'm not claiming it's
>perfect, by a long shot, but I think it's meaningful.

I don't. E.g., consider that an 8-core Core i9-11900K can consume
more than 250W, similar to a 64-core EPYC. For a typical server
workload, I doubt that the amount of work done by the Core i9 is
anywhere near the EPYC.

Of course, the other metrics also have their flaws, but at least wrt
to cores used in servers that are not as far off as TDP. And that
even includes the Firestorm core (Apple); yes, it has about a factor
1.5 higher IPC (i.e., work/frequency ratio) than Intel's and AMD's
offerings, but it's advantage in power consumption is higher than 1.5.

>> Apparently they think that the 1MB L2 is big enough for their
>> customers, and they use the L3 only as a victim cache and for
>> communicating between cores. There seem to be enough usage patterns
>> that do not need that much cache; AMD has announced that they will
>> offer server CPUs (Bergamo) with more cores and less cache
>
>I was wondering if their decision was also linked to the
>microarchitecture of the cores themselves. E.g. they have too few
>in-flight instructions to withstand an L2-miss-L3-hit without stalling,
>and that in turn (somehow) makes a big LLC less beneficial (maybe not in
>terms of single-thread performance but in terms of overall consumption)?

I doubt it, for two reasons:

* If there are a significant number of L2 misses, serving them from L3
is much better than serving them from main memory especially if the
core does not support many in-flight L2 misses. Conversely, for
Stream-like applications, and assuming that main memory has the same
bandwidth as L3, if you support enough in-flight accesses to cover
main memory, you don't need L3 (even if the working set would fit in
L3). But I doubt that the conditions are satisfied.

* Simultaneous in-flight instructions are important for Stream-like
stuff, and not at all for pointer-chasing. My impression is that
the typical cloud stuff is more along the lines of pointer-chasing
than HPC (e.g., serving a web request).

I also don't think that code sharing is the reason for this L3 size:
with the current design, the shared code would be duplicated a lot in
the L2 caches and would consume a lot of that (especially if the code
is so unlocal that the L3 plays a role). If I wanted to exploit code
sharing, I would design separate I-caches out until the shared cache;
e.g. have a per-core L2 I-cache just big enough that the number of
requests to the shared L3 cache does not produce performance problems.
That leaves the L2 D-caches for keeping (typically unshared) data.

I still think that the CPU is designed for applications with few L2
misses.

MitchAlsup

unread,

Dec 5, 2021, 4:47:25 PM12/5/21

to

On Friday, December 3, 2021 at 11:05:41 AM UTC-8, Stefan Monnier wrote:
> I just bumped into the Ampere Altra's specifications and was struck: for
> a processor with 250W TDP isn't a 32MB last-level cache small?
<

A) it depends on the workload
<
B) it depends on what the L3 cache is topologically.
<
If LLC is essentially a DRAM cache where DRAM lines are bought in, ECC
checked, fixed, written back out after merging with CPU and I/O Write
traffic, it very well may be big enough. It's not trying to decrease latency
seen at the CPU, but it is trying to reduce the latency of request arriving
at DRAM.
<
C) there may be cost reasons unknown to us (community) as to why
this kind of decision was made.

Michael S

unread,

Dec 5, 2021, 5:16:48 PM12/5/21

to

Is Ampere currently in business of selling general-purpose server CPUs to any buyer or in business of selling themselves to one of the big guys (Google-Facebook-Microsoft, less likely Oracle)?
Probably, more later than the former. If true, your question is not mighty interesting.