AVX512 syntax

Tocar, Ilya

unread,

Nov 10, 2017, 3:34:13 PM11/10/17

to golan...@googlegroups.com, Russ Cox, fr...@minio.io, Sharipov, Iskander

Hi,

We (Intel) are planning to work on AVX512 asm support for go 1.11,

However there are few syntax issues, that I'd like to discuss first:

1)AVX512 adds 16 new register (xmm16-31) and extends register width to 512-bit (zmm0-31).

It also adds mask registers k0-k7.

I think that adding X16-31,Y16-31,Z0-Z31 and K0-K7 registers is the most straightforward approach.

2)AVX512 adds masking support with optional zeroing.

Here is an example with Gnu syntax

vpslld $0xab, %zmm5, %zmm6{%k7} // masking

vpslld $0xab, %zmm5, %zmm6{%k7}{z} // masking with zeroing

So the question is: do we want special syntax?

One alternative is to copy Gnu approach, this will results in something like:

VPSSLD $171, Z5, Z6{K7}

VPSSLD $171, Z5, Z6{K7}{Z}

As far as I can see, no other Go architecture has something like this.

Other approach is to avoid special syntax at all and just add pseudo-register "Z":

VPSSLD $171, Z5, Z6, K7

VPSSLD $171, Z5, Z6, K7, Z

I don't really like Z part, due to confusion with Z registers

I guess we can use .Z suffix for zeroing

VPSSLD $171, Z5, Z6, K7

VPSSLD.Z $171, Z5, Z6, K7

3) AVX512 also adds an option to broadcast value from memory to fill the whole register:

vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax

This can be implemented the same way as in gas

VPSSLD $123, 508(DX){1to16}, Z6

Or as a suffix

VPSSLD.1to16 $123, 508(DX), Z6

Or as a separate operand

VPSSLD $123, 508(DX), 1to16, Z6

4) Embedded rounding also has similar problem:

Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be

VADDPD {rn-sae}, Z4, Z5, Z6

Or should rounding be suffix/separate argument?

Bear in mind that broadcasting and rounding cannot apply to the same instruction, but

masking and zeroing masking can coexist with both.

5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,

because we can't distinguish beetween 128 and 256 bit versions in some cases:

VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix

But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.

So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?

6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:

VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?

Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:

//check avx support

VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512

So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.

Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.

7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)

Have weirder syntax:

V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual

Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)

Gas ignores this and goes with simple

v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned

So should go support gas-like version

V4FMADDPS (CX), Z4, Z1

Intel manual like syntax:

V4FMADDPS (CX), Z4+3, Z1

Or require explicit operands:

V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers

Or go for something like arm’s STM?

V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1

What are your thoughts?

Brad Fitzpatrick

unread,

Nov 10, 2017, 10:48:29 PM11/10/17

to Tocar, Ilya, golan...@googlegroups.com, Russ Cox, fr...@minio.io, Sharipov, Iskander

Discussion here is fine, but if you don't get the answers you seek, you should file a bug. At least with GitHub issues, issues stay open until they're addressed. Email threads sometimes die out without anybody noticing they died out.

Actually you should probably open a bug at some point anyway.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Russ Cox

unread,

Nov 13, 2017, 10:29:18 AM11/13/17

to Tocar, Ilya, golan...@googlegroups.com, fr...@minio.io, Sharipov, Iskander

On Fri, Nov 10, 2017 at 3:34 PM, Tocar, Ilya <ilya....@intel.com> wrote:

Hi,

We (Intel) are planning to work on AVX512 asm support for go 1.11,

However there are few syntax issues, that I'd like to discuss first:

1)AVX512 adds 16 new register (xmm16-31) and extends register width to 512-bit (zmm0-31).

It also adds mask registers k0-k7.

I think that adding X16-31,Y16-31,Z0-Z31 and K0-K7 registers is the most straightforward approach.

OK.

2)AVX512 adds masking support with optional zeroing.

Here is an example with Gnu syntax

vpslld $0xab, %zmm5, %zmm6{%k7} // masking

vpslld $0xab, %zmm5, %zmm6{%k7}{z} // masking with zeroing

So the question is: do we want special syntax?

One alternative is to copy Gnu approach, this will results in something like:

VPSSLD $171, Z5, Z6{K7}

VPSSLD $171, Z5, Z6{K7}{Z}

As far as I can see, no other Go architecture has something like this.

Yes, let's not introduce a whole new syntactic form here. We try to keep the syntax as regular as possible, meaning no special { } just for x86.

Other approach is to avoid special syntax at all and just add pseudo-register "Z":

VPSSLD $171, Z5, Z6, K7

VPSSLD $171, Z5, Z6, K7, Z

I don't really like Z part, due to confusion with Z registers

I guess we can use .Z suffix for zeroing

VPSSLD $171, Z5, Z6, K7

VPSSLD.Z $171, Z5, Z6, K7

VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?

3) AVX512 also adds an option to broadcast value from memory to fill the whole register:

vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax

This can be implemented the same way as in gas

VPSSLD $123, 508(DX){1to16}, Z6

Or as a suffix

VPSSLD.1to16 $123, 508(DX), Z6

Or as a separate operand

VPSSLD $123, 508(DX), 1to16, Z6

Is "1to16" the only possible setting? What is the general form of this new option?

4) Embedded rounding also has similar problem:

Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be

VADDPD {rn-sae}, Z4, Z5, Z6

Or should rounding be suffix/separate argument?

Bear in mind that broadcasting and rounding cannot apply to the same instruction, but

masking and zeroing masking can coexist with both.

What's the general form here?

5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,

because we can't distinguish beetween 128 and 256 bit versions in some cases:

VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix

But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.

So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?

Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?

6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:

VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?

Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:

//check avx support

VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512

So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.

Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.

We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.

7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)

Have weirder syntax:

V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual

Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)

Gas ignores this and goes with simple

v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned

So should go support gas-like version

V4FMADDPS (CX), Z4, Z1

Intel manual like syntax:

V4FMADDPS (CX), Z4+3, Z1

Or require explicit operands:

V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers

Or go for something like arm’s STM?

V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1

Let's reuse the Go ARM syntax, since that avoids adding a new syntactic form for exactly the same concept. Note that you can shorten [Z4,Z5,Z6,Z7] to [Z4-Z7].

Thanks.

Russ

ilya....@intel.com

unread,

Nov 13, 2017, 2:19:07 PM11/13/17

to golang-dev

VPSSLD $171, Z5, Z6, K7

VPSSLD.Z $171, Z5, Z6, K7

VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?

Masking/zeroing goes after destination operand and it looks like all implementations agree with each-other:

VPSLLVW zmm1 {k1}{z}, zmm2, zmm3/m512 //from Intel manual

vpcompressd [rax] {k1}, zmm1 // Microsof implementation from https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/

vaddps zmm6{k7}{z}, zmm5, zmm4 // from gas tests

3) AVX512 also adds an option to broadcast value from memory to fill the whole register:

vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax

This can be implemented the same way as in gas

VPSSLD $123, 508(DX){1to16}, Z6

Or as a suffix

VPSSLD.1to16 $123, 508(DX), Z6

Or as a separate operand

VPSSLD $123, 508(DX), 1to16, Z6

Is "1to16" the only possible setting? What is the general form of this new option?

One of {1tox}, where x can be 2,4,8,16

Interestingly MASM adds bcst keyword instead,

vcmpltps k1, zmm2, dword bcst [rbp] // instead of {1to16}

4) Embedded rounding also has similar problem:

Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be

VADDPD {rn-sae}, Z4, Z5, Z6

Or should rounding be suffix/separate argument?

Bear in mind that broadcasting and rounding cannot apply to the same instruction, but

masking and zeroing masking can coexist with both.

What's the general form here?

Either {rn-sae},{ru-sae},{rd-sae},{rz-sae} or, for some instruction just {sae}.

Also this case is even more tricky!

Initially gas put rounding after last xmm, possibly before gpr:

cvtsi2ss xmm6, xmm5, {rz-sae}, ebp // first released version of gas with AVX512

But later is also started accepting version with rounding as last argument:

vcvtsi2ss xmm6, xmm5, eax, {rn-sae} // same as Intel manual

I'm not sure what MASM does.

5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,

because we can't distinguish beetween 128 and 256 bit versions in some cases:

VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix

But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.

So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?

Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?

No, I'm saying that VCVTPD2PS with 1to2 removes confusion between X and Y versions.

6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:

VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?

Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:

//check avx support

VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512

So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.

Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.

We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.

Ok, just making sure.

7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)

Have weirder syntax:

V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual

Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)

Gas ignores this and goes with simple

v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned

So should go support gas-like version

V4FMADDPS (CX), Z4, Z1

Intel manual like syntax:

V4FMADDPS (CX), Z4+3, Z1

Or require explicit operands:

V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers

Or go for something like arm’s STM?

V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1

Let's reuse the Go ARM syntax, since that avoids adding a new syntactic form for exactly the same concept. Note that you can shorten [Z4,Z5,Z6,Z7] to [Z4-Z7].

[Z4-Z7] looks really nice!

Austin Clements

unread,

Nov 13, 2017, 4:31:31 PM11/13/17

to Tocar, Ilya, golan...@googlegroups.com, Russ Cox, fr...@minio.io, Sharipov, Iskander

On Fri, Nov 10, 2017 at 3:34 PM, Tocar, Ilya <ilya....@intel.com> wrote:

Hi,

We (Intel) are planning to work on AVX512 asm support for go 1.11,

However there are few syntax issues, that I'd like to discuss first:

1)AVX512 adds 16 new register (xmm16-31) and extends register width to 512-bit (zmm0-31).

It also adds mask registers k0-k7.

I think that adding X16-31,Y16-31,Z0-Z31 and K0-K7 registers is the most straightforward approach.

For other situations where registers can be referred to with different widths, we've encoded the width in the instruction rather than the register naming. Is there a reason to separate out the X, Y, and Z registers here rather than putting the width in the instruction name? (It looks like perhaps we've already gone down this path with the Y registers, which may be reason enough.)

ilya....@intel.com

unread,

Nov 13, 2017, 4:55:34 PM11/13/17

to golang-dev

Mainly because we already do this with X/Y, but also:

1) For better compatibility with other asm. Because porting something to go syntax with s/zmm/Z is easier than changing each instruction.

2) For beater search experience:

Your search - vpternlogdz - did not match any documents.

Vs a bunch of article/docs for vpternlogd

Frank Wessels

unread,

Nov 13, 2017, 7:22:40 PM11/13/17

to Russ Cox, Tocar, Ilya, golan...@googlegroups.com, Sharipov, Iskander

I think it makes more sense to swap the last two arguments, thus:

VPSSLD.Z $171, Z5, K7, Z6

Then the last argument is still the destination and the mask (as penultimate argument) is used as a “filter/mask” for the destination register.

Note that the mask can be omitted (or implicitly replaced with K0 which is hardcoded for full access).

Since there are also instructions to set the mask registers (eg. "kmovq k1, rax", plus kadd / kand / knot etc.) , I think it is nice/consistent to treat them as a “regular” registers.

NB intel syntax is eg: vmovdqu64 zmm4{k1}{z}, [rdx]

3) AVX512 also adds an option to broadcast value from memory to fill the whole register:
vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax

This can be implemented the same way as in gas

VPSSLD $123, 508(DX){1to16}, Z6

Or as a suffix

VPSSLD.1to16 $123, 508(DX), Z6

Or as a separate operand

VPSSLD $123, 508(DX), 1to16, Z6

Is "1to16" the only possible setting? What is the general form of this new option?

There is also {1to4} and {1to8} (for XMM and YMM resp. I believe).

But the suffix approach would probably work best IMO.

4) Embedded rounding also has similar problem:
Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be

VADDPD {rn-sae}, Z4, Z5, Z6

Or should rounding be suffix/separate argument?
Bear in mind that broadcasting and rounding cannot apply to the same instruction, but
masking and zeroing masking can coexist with both.

What's the general form here?

5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
because we can't distinguish beetween 128 and 256 bit versions in some cases:
VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?

Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?

6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512

So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.

We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.

I guess quite some projects are using “dynamic” switching based on CPU capabilities, so if a subroutine has AVX512 (EVEX) instructions, it would be nice if the whole routine could take advantage of EVEX encoding.

Would adding a flag like “EVEX_ENCODING” in “text flag.h” make sense to enable this?

Chad Kunde

unread,

Nov 13, 2017, 8:24:31 PM11/13/17

to golang-dev

On Monday, November 13, 2017 at 11:19:07 AM UTC-8, ilya....@intel.com wrote:

VPSSLD $171, Z5, Z6, K7

VPSSLD.Z $171, Z5, Z6, K7

VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?

Masking/zeroing goes after destination operand and it looks like all implementations agree with each-other:
VPSLLVW zmm1 {k1}{z}, zmm2, zmm3/m512 //from Intel manual
vpcompressd [rax] {k1}, zmm1 // Microsof implementation from https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/
vaddps zmm6{k7}{z}, zmm5, zmm4 // from gas tests

Using the instruction suffix seems clear to me here, as it clearly signals a change to a register. The masking seems the more complex issue.

My first thought would be to use the addressing syntax:

VADDPS.Z Z6, Z5, Z4(K7)

But that leads to addressing and masking, which can be confusing:

VPCOMPRESSD.Z 512(AX)(DX*2)(K3), Z3

If we enforce the rule that masking must be last, I don't mind this format. Mixing the order would only lead to problems reading/parsing.

The suffix approach here would make the instruction more explicit as to what part of the register is being modified.

There's a knock-on effect here, where a second suffix can be used. Should we have an ordering requirement to ensure the instruction reads the same way each time?

If I were to decide, I'd require zeroing suffix last; it's operating on a different register than the broadcast and base portions:

VADDPD.RN.Z Z4, Z5, Z6(K2)

VPSSLD.1to16.Z $123, 508(DX), Z6(K4)

5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,

because we can't distinguish beetween 128 and 256 bit versions in some cases:

VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix

But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.

So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?

Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?
No, I'm saying that VCVTPD2PS with 1to2 removes confusion between X and Y versions.

6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:

VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?

Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:

//check avx support

VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512

So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.

Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.

We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.
Ok, just making sure.

We might be better served creating a way to set the encoding for the entire build, leaving the AVX512 checks to the developer. When mixing EVEX-only with VEX/EVEX instructions, the CPU-check is already needed. The encoding benefit should be available, too.

7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)

Have weirder syntax:

V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual

Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)

Gas ignores this and goes with simple

v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned

So should go support gas-like version

V4FMADDPS (CX), Z4, Z1

Intel manual like syntax:

V4FMADDPS (CX), Z4+3, Z1

Or require explicit operands:

V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers

Or go for something like arm’s STM?

V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1

Let's reuse the Go ARM syntax, since that avoids adding a new syntactic form for exactly the same concept. Note that you can shorten [Z4,Z5,Z6,Z7] to [Z4-Z7].

[Z4-Z7] looks really nice!

NEON-style syntax is short and clear. I think it's the best option here, too.

ilya....@intel.com

unread,

Nov 14, 2017, 3:02:41 PM11/14/17

to golang-dev

On Monday, November 13, 2017 at 7:24:31 PM UTC-6, Chad Kunde wrote:

On Monday, November 13, 2017 at 11:19:07 AM UTC-8, ilya....@intel.com wrote:

VPSSLD $171, Z5, Z6, K7

VPSSLD.Z $171, Z5, Z6, K7

VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?

Masking/zeroing goes after destination operand and it looks like all implementations agree with each-other:
VPSLLVW zmm1 {k1}{z}, zmm2, zmm3/m512 //from Intel manual
vpcompressd [rax] {k1}, zmm1 // Microsof implementation from https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/
vaddps zmm6{k7}{z}, zmm5, zmm4 // from gas tests

Using the instruction suffix seems clear to me here, as it clearly signals a change to a register. The masking seems the more complex issue.
My first thought would be to use the addressing syntax:
VADDPS.Z Z6, Z5, Z4(K7)

But that leads to addressing and masking, which can be confusing:
VPCOMPRESSD.Z 512(AX)(DX*2)(K3), Z3

If we enforce the rule that masking must be last, I don't mind this format. Mixing the order would only lead to problems reading/parsing.

Maybe we can use address syntax for zeroing:

VADDPS Z6, Z5, Z4, K7(Z)

This way we can only have one suffix for rounding or broadcast and avoid troubles with suffix ordering.

6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:

VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?

Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:

//check avx support

VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512

So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.

Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.

We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.
Ok, just making sure.

We might be better served creating a way to set the encoding for the entire build, leaving the AVX512 checks to the developer. When mixing EVEX-only with VEX/EVEX instructions, the CPU-check is already needed. The encoding benefit should be available, too.

I'm not sure what do you mean by entire build, whole function?

This will fail in following case:

support_avx

VADDSD 512(BX), X2, X3 // avx512 not supported, must use vex

RET

support_avx512

VADDSD X1, X2, X3 // avx512 supported, can use evex, but vex is shorter

VADDSD 512(BX), X2, X3 // avx512 supported, can use evex and it is shorter

RET

Also I'm worried about complexity/benefit trade-off.

EVEX encoded instruction is only shorter if it has memory operand with offset that is multiple of register length and is bigger than 128 but shorter than 4096 , which should be rare enough.

There is no penalty for mixing avx/avx512 so we will at most save some space.

For reference gas uses vex where possible.

iskander...@intel.com

unread,

Nov 17, 2017, 8:03:55 AM11/17/17

to golang-dev

GitHub issue: #22779

Reply all

Reply to author

Forward