Why vzeroall/vzeroupper on Intel?

Melzzzzz

unread,

Nov 28, 2014, 3:56:00 PM11/28/14

to

As I checked these instructions are needed on Intel CPU's
to clear YMM registers before AVX usage in order to
perform well.
Also I have noticed major slowdown (on Intel) when mixing
SSE and AVX instructions. I heard that AMD CPU's does not
have such problem. Is this something to do with better design,
or is it difficult and has some implications?

JohnG

unread,

Nov 28, 2014, 8:18:08 PM11/28/14

to

For Intel, see "Intel® 64 and IA-32 Architectures Optimization Reference Manual" sec 11.3

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

For AMD, see "Software Optimization Guide for AMD Family 15h Processors" sec 5.5

http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf

-JohnG

Melzzzzz

unread,

Nov 29, 2014, 12:34:15 AM11/29/14

to

Seems that I was wrong. AMD has same rules as Intel when mixing
128/256 bit, so it seems...

Dunamis777

unread,

Nov 30, 2014, 4:13:55 PM11/30/14

to

Melzzzz,

Executing VZEROALL/VZEROUPPER at the *end* of a section of AVX code allows an x86 core to free up physical (rename) registers to speed up execution of subsequent SSE code ... though I didn't think you needed to execute them *before* AVX usage.

For instance, let's say a microarchitecture is mapping a 256-bit AVX architectural/logical register (XMM0-XMM15) to 2 x 128-bit physical registers i.e the width of the logical register is twice the width of the physical register. During execution of AVX-256 code that touches all 16 XMM regs, a max of 16*2=32 physical registers are consumed from the physical register file to map to the 16 XMM logical registers. But after execution of VZEROUPPER, the 16 'upper' registers can be freed and returned to the rename pool of free registers. Upon execution of VZEROALL, 32 regs can be freed and returned to the rename pool of free registers. This can be done with the help of additional state per logical register, which we can refer to as zero bits (z-bits) that are maintained in the speculative register map. The z-bits indicates whether the upper 128b of a register is zero or its lower 128b is zero. Freeing up physical regs can allow a subsequent section of SSE code to execute faster so it's a good practice to insert a VZEROALL/VZEROUPPER at end of AVX subroutine to give a hint to the hardware to free up resources, if in fact, software doesn't intend to execute AVX code for some time.

With regards to mixing SSE and AVX code, I would expect a penalty on both Intel and AMD CPUs when transitioning between SSE <-> AVX-256 code. Are you also seeing a penalty when transitioning between SSE <-> AVX-128 code and if so, on which CPUs are you seeing this penalty and can you provide an estimate of how much it is?

Melzzzzz

unread,

Nov 30, 2014, 4:55:20 PM11/30/14

to

On Sun, 30 Nov 2014 13:13:51 -0800 (PST)
Dunamis777 <kelvin...@gmail.com> wrote:

> Melzzzz,

>
> With regards to mixing SSE and AVX code, I would expect a penalty on
> both Intel and AMD CPUs when transitioning between SSE <-> AVX-256
> code. Are you also seeing a penalty when transitioning between SSE
> <-> AVX-128 code and if so, on which CPUs are you seeing this penalty
> and can you provide an estimate of how much it is?

Didn't checked that, but I guess that mixing 128bit SSE with 128bit
AVX is fine.
Yes, checked, 128 bit is fine. (i7 4790)

>