Hi,
We (Intel) are planning to work on AVX512 asm support for go 1.11,
However there are few syntax issues, that I'd like to discuss first:
1)AVX512 adds 16 new register (xmm16-31) and extends register width to 512-bit (zmm0-31).
It also adds mask registers k0-k7.
I think that adding X16-31,Y16-31,Z0-Z31 and K0-K7 registers is the most straightforward approach.
2)AVX512 adds masking support with optional zeroing.
Here is an example with Gnu syntax
vpslld $0xab, %zmm5, %zmm6{%k7} // masking
vpslld $0xab, %zmm5, %zmm6{%k7}{z} // masking with zeroing
So the question is: do we want special syntax?
One alternative is to copy Gnu approach, this will results in something like:
VPSSLD $171, Z5, Z6{K7}
VPSSLD $171, Z5, Z6{K7}{Z}
As far as I can see, no other Go architecture has something like this.
Other approach is to avoid special syntax at all and just add pseudo-register "Z":
VPSSLD $171, Z5, Z6, K7
VPSSLD $171, Z5, Z6, K7, Z
I don't really like Z part, due to confusion with Z registers
I guess we can use .Z suffix for zeroing
VPSSLD $171, Z5, Z6, K7
VPSSLD.Z $171, Z5, Z6, K7
3) AVX512 also adds an option to broadcast value from memory to fill the whole register:
vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax
This can be implemented the same way as in gas
VPSSLD $123, 508(DX){1to16}, Z6
Or as a suffix
VPSSLD.1to16 $123, 508(DX), Z6
Or as a separate operand
VPSSLD $123, 508(DX), 1to16, Z6
4) Embedded rounding also has similar problem:
Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be
VADDPD {rn-sae}, Z4, Z5, Z6
Or should rounding be suffix/separate argument?
Bear in mind that broadcasting and rounding cannot apply to the same instruction, but
masking and zeroing masking can coexist with both.
5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
because we can't distinguish beetween 128 and 256 bit versions in some cases:
VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?
6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512
So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.
7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)
Have weirder syntax:
V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual
Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)
Gas ignores this and goes with simple
v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned
So should go support gas-like version
V4FMADDPS (CX), Z4, Z1
Intel manual like syntax:
V4FMADDPS (CX), Z4+3, Z1
Or require explicit operands:
V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers
Or go for something like arm’s STM?
V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1
What are your thoughts?
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi,
We (Intel) are planning to work on AVX512 asm support for go 1.11,
However there are few syntax issues, that I'd like to discuss first:
1)AVX512 adds 16 new register (xmm16-31) and extends register width to 512-bit (zmm0-31).
It also adds mask registers k0-k7.
I think that adding X16-31,Y16-31,Z0-Z31 and K0-K7 registers is the most straightforward approach.
2)AVX512 adds masking support with optional zeroing.
Here is an example with Gnu syntax
vpslld $0xab, %zmm5, %zmm6{%k7} // masking
vpslld $0xab, %zmm5, %zmm6{%k7}{z} // masking with zeroing
So the question is: do we want special syntax?
One alternative is to copy Gnu approach, this will results in something like:
VPSSLD $171, Z5, Z6{K7}
VPSSLD $171, Z5, Z6{K7}{Z}
As far as I can see, no other Go architecture has something like this.
Other approach is to avoid special syntax at all and just add pseudo-register "Z":
VPSSLD $171, Z5, Z6, K7
VPSSLD $171, Z5, Z6, K7, Z
I don't really like Z part, due to confusion with Z registers
I guess we can use .Z suffix for zeroing
VPSSLD $171, Z5, Z6, K7
VPSSLD.Z $171, Z5, Z6, K7
3) AVX512 also adds an option to broadcast value from memory to fill the whole register:
vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax
This can be implemented the same way as in gas
VPSSLD $123, 508(DX){1to16}, Z6
Or as a suffix
VPSSLD.1to16 $123, 508(DX), Z6
Or as a separate operand
VPSSLD $123, 508(DX), 1to16, Z6
4) Embedded rounding also has similar problem:
Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be
VADDPD {rn-sae}, Z4, Z5, Z6
Or should rounding be suffix/separate argument?
Bear in mind that broadcasting and rounding cannot apply to the same instruction, but
masking and zeroing masking can coexist with both.
5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
because we can't distinguish beetween 128 and 256 bit versions in some cases:
VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?
6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512
So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.
7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)
Have weirder syntax:
V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual
Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)
Gas ignores this and goes with simple
v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned
So should go support gas-like version
V4FMADDPS (CX), Z4, Z1
Intel manual like syntax:
V4FMADDPS (CX), Z4+3, Z1
Or require explicit operands:
V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers
Or go for something like arm’s STM?
V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1
VPSSLD $171, Z5, Z6, K7
VPSSLD.Z $171, Z5, Z6, K7
VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?
3) AVX512 also adds an option to broadcast value from memory to fill the whole register:
vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax
This can be implemented the same way as in gas
VPSSLD $123, 508(DX){1to16}, Z6
Or as a suffix
VPSSLD.1to16 $123, 508(DX), Z6
Or as a separate operand
VPSSLD $123, 508(DX), 1to16, Z6
Is "1to16" the only possible setting? What is the general form of this new option?
4) Embedded rounding also has similar problem:
Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be
VADDPD {rn-sae}, Z4, Z5, Z6
Or should rounding be suffix/separate argument?
Bear in mind that broadcasting and rounding cannot apply to the same instruction, but
masking and zeroing masking can coexist with both.
What's the general form here?
5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
because we can't distinguish beetween 128 and 256 bit versions in some cases:
VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?
Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?
6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512
So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.
We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.
7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)
Have weirder syntax:
V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual
Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)
Gas ignores this and goes with simple
v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned
So should go support gas-like version
V4FMADDPS (CX), Z4, Z1
Intel manual like syntax:
V4FMADDPS (CX), Z4+3, Z1
Or require explicit operands:
V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers
Or go for something like arm’s STM?
V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1
Let's reuse the Go ARM syntax, since that avoids adding a new syntactic form for exactly the same concept. Note that you can shorten [Z4,Z5,Z6,Z7] to [Z4-Z7].
Hi,
We (Intel) are planning to work on AVX512 asm support for go 1.11,
However there are few syntax issues, that I'd like to discuss first:
1)AVX512 adds 16 new register (xmm16-31) and extends register width to 512-bit (zmm0-31).
It also adds mask registers k0-k7.
I think that adding X16-31,Y16-31,Z0-Z31 and K0-K7 registers is the most straightforward approach.
3) AVX512 also adds an option to broadcast value from memory to fill the whole register:
vpslld $123, 508(%edx){1to16}, %zmm6 // Gnu syntax
This can be implemented the same way as in gas
VPSSLD $123, 508(DX){1to16}, Z6
Or as a suffix
VPSSLD.1to16 $123, 508(DX), Z6
Or as a separate operand
VPSSLD $123, 508(DX), 1to16, Z6
Is "1to16" the only possible setting? What is the general form of this new option?
4) Embedded rounding also has similar problem:
Should vaddpd {rn-sae}, %zmm4, %zmm5, %zmm6 be
VADDPD {rn-sae}, Z4, Z5, Z6
Or should rounding be suffix/separate argument?
Bear in mind that broadcasting and rounding cannot apply to the same instruction, but
masking and zeroing masking can coexist with both.
What's the general form here?5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
because we can't distinguish beetween 128 and 256 bit versions in some cases:
VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?
Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512
So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.
We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.
VPSSLD $171, Z5, Z6, K7
VPSSLD.Z $171, Z5, Z6, K7
VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?Masking/zeroing goes after destination operand and it looks like all implementations agree with each-other:VPSLLVW zmm1 {k1}{z}, zmm2, zmm3/m512 //from Intel manualvpcompressd [rax] {k1}, zmm1 // Microsof implementation from https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/vaddps zmm6{k7}{z}, zmm5, zmm4 // from gas tests
5) Currently VCVTPD2PS and some other instructions exists in 2 forms VCVTPD2PSX and VCVTPD2PSY,
because we can't distinguish beetween 128 and 256 bit versions in some cases:
VCVTPD2PS (R11), X2 // Destination is 128 bit in both cases, so we need X/Y suffix
But if we add broadcasting we can distinguish those cases. 1to4 means 256, 1to2 means 128.
So should we add a version of VCVTPD2PS without X/Y suffix, but with mandatory broadcasting?
Are you saying that VCVTPD2PS with 1to2 is exactly the same as the current VCVTPD2PSX?No, I'm saying that VCVTPD2PS with 1to2 removes confusion between X and Y versions.
6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512
So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.
We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.Ok, just making sure.
7) Latest instructions from AVX512_4FMAPS set (see https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf)
Have weirder syntax:
V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 // from manual
Where zmm2+3 means that it uses zmm2 and 3 next registers (zmm3,zmm4,zmm5)
Gas ignores this and goes with simple
v4fmaddps (%ecx), %zmm4, %zmm1 // only first register out of 4 is mentioned
So should go support gas-like version
V4FMADDPS (CX), Z4, Z1
Intel manual like syntax:
V4FMADDPS (CX), Z4+3, Z1
Or require explicit operands:
V4FMADDPS (CX), Z4,Z5, Z6, Z7, Z1 // will fail if 4,5,6,7 are not consecutive registers
Or go for something like arm’s STM?
V4FMADDPS (CX), [Z4,Z5,Z6,Z7], Z1
Let's reuse the Go ARM syntax, since that avoids adding a new syntactic form for exactly the same concept. Note that you can shorten [Z4,Z5,Z6,Z7] to [Z4-Z7].[Z4-Z7] looks really nice!
On Monday, November 13, 2017 at 11:19:07 AM UTC-8, ilya....@intel.com wrote:
VPSSLD $171, Z5, Z6, K7
VPSSLD.Z $171, Z5, Z6, K7
VPSSLD.Z seems fine. I'm a bit more worried about the K7 on the end of the instruction. The destination is supposed to be the last argument, and here it's the next to last as I understand things. What is the Intel syntax?Masking/zeroing goes after destination operand and it looks like all implementations agree with each-other:VPSLLVW zmm1 {k1}{z}, zmm2, zmm3/m512 //from Intel manualvpcompressd [rax] {k1}, zmm1 // Microsof implementation from https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/vaddps zmm6{k7}{z}, zmm5, zmm4 // from gas testsUsing the instruction suffix seems clear to me here, as it clearly signals a change to a register. The masking seems the more complex issue.My first thought would be to use the addressing syntax:VADDPS.Z Z6, Z5, Z4(K7)But that leads to addressing and masking, which can be confusing:VPCOMPRESSD.Z 512(AX)(DX*2)(K3), Z3If we enforce the rule that masking must be last, I don't mind this format. Mixing the order would only lead to problems reading/parsing.
6) Unlike sse and avx which differ in starting v (VADDPD vs ADDPS), with avx and avx512 same instruction can have 2 encodings:
VADDPD X1, X2, X3 // should we encode it as VEX or as EVEX?
Using VEX where possible may lead to longer encoding, but avoids possible illegal instructions in cases like this:
//check avx support
VADDPD 512(DX), X2, X3 //encoded as, EVEX will fail on cpus with avx, but without avx512
So it looks like we won’t be able to benefit from possible shorter encoding even if avx512 support is guaranteed.
Such cases should be rare, but I still wanted to mention this, in case there is something I’m overlooking.
We definitely shouldn't break existing working AVX code on non-AVX512 systems, so yes, I would say that we should keep using the VEX encoding when that's a real instruction.Ok, just making sure.We might be better served creating a way to set the encoding for the entire build, leaving the AVX512 checks to the developer. When mixing EVEX-only with VEX/EVEX instructions, the CPU-check is already needed. The encoding benefit should be available, too.