Stefan Monnier <
mon...@iro.umontreal.ca> writes:
>>>I just bumped into the Ampere Altra's specifications and was struck: for
>>>a processor with 250W TDP isn't a 32MB last-level cache small?
>> That's an odd relation. When I read that, I put it in relation to the
>> number of cores: 32MB L3 for 80 cores on the Altra Q80-33, and 16MB for 128
>> cores on the Altra Max M128-30.
>
>I find TDP to be a better approximation of "total word done per second"
>than the number of cores or the CPU's frequency. I'm not claiming it's
>perfect, by a long shot, but I think it's meaningful.
I don't. E.g., consider that an 8-core Core i9-11900K can consume
more than 250W, similar to a 64-core EPYC. For a typical server
workload, I doubt that the amount of work done by the Core i9 is
anywhere near the EPYC.
Of course, the other metrics also have their flaws, but at least wrt
to cores used in servers that are not as far off as TDP. And that
even includes the Firestorm core (Apple); yes, it has about a factor
1.5 higher IPC (i.e., work/frequency ratio) than Intel's and AMD's
offerings, but it's advantage in power consumption is higher than 1.5.
>> Apparently they think that the 1MB L2 is big enough for their
>> customers, and they use the L3 only as a victim cache and for
>> communicating between cores. There seem to be enough usage patterns
>> that do not need that much cache; AMD has announced that they will
>> offer server CPUs (Bergamo) with more cores and less cache
>
>I was wondering if their decision was also linked to the
>microarchitecture of the cores themselves. E.g. they have too few
>in-flight instructions to withstand an L2-miss-L3-hit without stalling,
>and that in turn (somehow) makes a big LLC less beneficial (maybe not in
>terms of single-thread performance but in terms of overall consumption)?
I doubt it, for two reasons:
* If there are a significant number of L2 misses, serving them from L3
is much better than serving them from main memory especially if the
core does not support many in-flight L2 misses. Conversely, for
Stream-like applications, and assuming that main memory has the same
bandwidth as L3, if you support enough in-flight accesses to cover
main memory, you don't need L3 (even if the working set would fit in
L3). But I doubt that the conditions are satisfied.
* Simultaneous in-flight instructions are important for Stream-like
stuff, and not at all for pointer-chasing. My impression is that
the typical cloud stuff is more along the lines of pointer-chasing
than HPC (e.g., serving a web request).
I also don't think that code sharing is the reason for this L3 size:
with the current design, the shared code would be duplicated a lot in
the L2 caches and would consume a lot of that (especially if the code
is so unlocal that the L3 plays a role). If I wanted to exploit code
sharing, I would design separate I-caches out until the shared cache;
e.g. have a per-core L2 I-cache just big enough that the number of
requests to the shared L3 cache does not produce performance problems.
That leaves the L2 D-caches for keeping (typically unshared) data.
I still think that the CPU is designed for applications with few L2
misses.