Teardown of Zen 4 AVX-512

Marcus

unread,

Sep 27, 2022, 1:31:56 AM9/27/22

to

Here is a detailed teardown of the Zen 4 AVX-512 implementation:
https://www.mersenneforum.org/showthread.php?p=614191

I especially note that AMD's "double-pumped" processing of 512-bit
vectors reduces the front-end load compared to using 256-bit vectors (on
the same machine), which enables the CPU to clock *higher* during
512-bit AVX-512 loads than 256-bit AVX2 loads.

Thus, in the age of dark silicon it appears that having "super
instructions" is the right way to go (e.g. VVM VEC/LOOP, Libre-SOC SV
and MRISC32 vector operations).

/Marcus

Thomas Koenig

unread,

Oct 1, 2022, 12:14:12 PM10/1/22

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> Here is a detailed teardown of the Zen 4 AVX-512 implementation:
> https://www.mersenneforum.org/showthread.php?p=614191
>
> I especially note that AMD's "double-pumped" processing of 512-bit
> vectors reduces the front-end load compared to using 256-bit vectors (on
> the same machine), which enables the CPU to clock *higher* during
> 512-bit AVX-512 loads than 256-bit AVX2 loads.

Yes, interesting. Seems like a good compromise - if you (or the
compiler) managed to package the load into 512-bit sized chunks,
then delivering two of the results in two consecutive cycles seems
like a good compromose.

> Thus, in the age of dark silicon

What is the age of dark silicon? It has a nice ring to it, but I'm
not quite sure what you mean.

>it appears that having "super
> instructions" is the right way to go (e.g. VVM VEC/LOOP, Libre-SOC SV
> and MRISC32 vector operations).

Issuing instructions is significant work, too.

MitchAlsup

unread,

Oct 1, 2022, 12:19:23 PM10/1/22

to

On Saturday, October 1, 2022 at 11:14:12 AM UTC-5, Thomas Koenig wrote:
> Marcus <m.de...@this.bitsnbites.eu> schrieb:
> > Here is a detailed teardown of the Zen 4 AVX-512 implementation:
> > https://www.mersenneforum.org/showthread.php?p=614191
> >
> > I especially note that AMD's "double-pumped" processing of 512-bit
> > vectors reduces the front-end load compared to using 256-bit vectors (on
> > the same machine), which enables the CPU to clock *higher* during
> > 512-bit AVX-512 loads than 256-bit AVX2 loads.
> Yes, interesting. Seems like a good compromise - if you (or the
> compiler) managed to package the load into 512-bit sized chunks,
> then delivering two of the results in two consecutive cycles seems
> like a good compromose.
<
> > Thus, in the age of dark silicon
<
> What is the age of dark silicon? It has a nice ring to it, but I'm
> not quite sure what you mean.
<

Dark silicon is silicon area that is present but has to be powered down
because if you simply leave it powered up all the time it consumes too
much energy, blowing your power budget/rating.

Simon Sabato

unread,

Oct 12, 2022, 4:53:49 AM10/12/22

to

On Saturday, October 1, 2022 at 9:19:23 AM UTC-7, MitchAlsup wrote:
> On Saturday, October 1, 2022 at 11:14:12 AM UTC-5, Thomas Koenig wrote:
> > Marcus <m.de...@this.bitsnbites.eu> schrieb:

> <

> > What is the age of dark silicon? It has a nice ring to it, but I'm
> > not quite sure what you mean.
> <
> Dark silicon is silicon area that is present but has to be powered down
> because if you simply leave it powered up all the time it consumes too
> much energy, blowing your power budget/rating.
> <

Not saying that's wrong, but I can speak to a slightly different usage in my experience (maybe a more optimistic view? :) )

In the old days, gates were the limiting factor, so we designed a smaller number of circuits that were highly flexible. Today, in most cases, power is the limiting factor, so we increasingly find ourselves adding fixed-function units, only "lighting them up" when needed. Fixed logic is more efficient than general purpose, ie an ASIC beats a CPU.

A classic example would be video encoder/decoder in every GPU now. But honestly, if you step back, you can argue that we've been on this road a long time (old-school 3D accelerators, AES-NI, even FPUs are all about adding more efficient single-purpose HW that is only used when a particular function is needed).

I only started hearing "dark silicon" ~8 years back, when it became clear that we probably should, for example, add 30% die area for a feature that is rarely used. Over the lifetime of the chip, the silicon is say 20% of total cost of ownership, with power being most of the rest. That 30% die area feature is what we'd call "dark silicon". When you consider that fixed function HW is often 10-100X more efficient than a SW implementation, you can imagine that it makes sense to include such HW even if it only runs say 10-20% of your overall workload. 20% * 30% means the function added ~6% to your cost, and it's going to take 10-20% of your work and do it 10-100X more efficiently.

In other words it IS TRUE what Mitch said, that you can't keep all this logic powered on at the same time, so you have to turn some off. Just pointing out it's often "by design" ... not "a shame" ... we could always turn the whole chip on at once, at lower clocks, it just doesn't work as well that way...

(There's also the fact that, if you're willing to separately test that 30% feature, you can sell a version of your chip without it, and get a decent number of partially functional dies "for free" instead of just throwing them out)

(There's also maybe something to be said about chiplets here, and how they enable late-binding decisions on which "30% functions" to drop into the package)

BTW sorry to bump a semi-necro thread, I just rediscovered comp.arch today, and I'm tickled pink that a forum exists where people are talking good nitty gritty computer stuff. I'm a long term chip guy myself, so pleased to meet and respect to you all !!!

-Simon Sabato

Quadibloc

unread,

Oct 12, 2022, 9:55:53 AM10/12/22

to

On Saturday, October 1, 2022 at 10:14:12 AM UTC-6, Thomas Koenig wrote:

> What is the age of dark silicon? It has a nice ring to it, but I'm
> not quite sure what you mean.

I was tempted to start a reply with "When the Moon is in the
Seventh House, and Jupiter aligns with Mars..." but the issue
has been properly covered in subsequent posts:

Now that Moore's Law has enabled the construction of microprocessor
chips with as many as four! CPUs on a single die - or even more - and
these CPUs having *megabytes* of cache, and being very large CPUs to
begin with, as they incorporate large amounts of extra circuitry for
elaborate out-of-order execution features... it cannot be wondered at
that it is considered worthwhile to use some of that very large transistor
budget to include on the same die extra circuitry to accelerate specialized
functions - which, if turned off when not in use, will add less power consumption
and heat (through leakage - CMOS isn't supposed to be using power when it
isn't doing anything, but today's super-tiny transistors are less and less well
approximated by the simple imaginary ideal transistors used in the first stage
of design) to that of the rest of the chip.

When Moore's Law gives you a billion transistors, and things like compression
formats and encryption algorithms get standardized...

John Savard

Brett

unread,

Oct 12, 2022, 5:25:12 PM10/12/22

to

Around a third of an Apple die is dark silicon, things like the neural
engine and the rest that is not labeled. Intel and AMD are not much behind.
Lots of encode decode engines mostly for the cameras I would guess.

https://semianalysis.substack.com/p/apple-m2-die-shot-and-architecture

Scott Lurndal

unread,

Oct 12, 2022, 7:48:12 PM10/12/22

to

The unlabelled portions of the M2 die from that site include:

- Media Engine encoder/decoder
- Video Engine
- Security Enclave (microcontrollers, TPM, etc).
- Image Signal Processor
- USB and Thunderbolt controllers
- I3C (mainboard bus)
- SPI (flash)
- power management subsystem