SIMD support status in Golang

2,042 views
Skip to first unread message

Zhongwei Yao

unread,
Dec 31, 2015, 1:04:47 AM12/31/15
to golang-dev
Hi, all,
What is the status of SIMD support of Golang (especially in Golang's assembler and linker) for AMD64?

I've checked ARM SIMD does not work in current master. And if SIMD is not supported under all architecture, is there any plan to add it?

--
Best regards,
Zhongwei

Keith Randall

unread,
Dec 31, 2015, 8:11:12 PM12/31/15
to Zhongwei Yao, golang-dev
I don't know specifically about ARM, but the assembler does support SIMD instructions in general.  We use a bunch of them for AMD64.  We don't support every instruction, but most of the common ones are in there.  We can add more if you need them.

You can always use BYTE directives to encode any instruction you want.


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Chase

unread,
Jan 1, 2016, 12:51:20 PM1/1/16
to Keith Randall, Zhongwei Yao, golang-dev
If there are particular missing instructions that would be helpful to you, that's useful information, too.

Zhongwei Yao

unread,
Jan 1, 2016, 7:50:51 PM1/1/16
to Keith Randall, golang-dev

Thanks, I'll check it.

Erwin Driessens

unread,
Jan 3, 2016, 8:25:07 PM1/3/16
to golang-dev, k...@google.com, zhongw...@linaro.org

If there are particular missing instructions that would be helpful to you, that's useful information, too.
 
I've had to use BYTE directives to get HADDPS, HADDPD (SSE3), and soon i have to use some SSE4.1 and 4.2 instructions that i doubt are supported yet. Please add the full set of SSE instructions.  
 

Zhongwei Yao

unread,
Jan 4, 2016, 4:38:12 AM1/4/16
to David Chase, Keith Randall, golang-dev
Hi, David,
On the ARM64 golang platform, it seems not support ARM SIMD (NEON) instruction at all. And there is no "BYTE" instruction support on ARM64.

I've try operation like: "MUL V0, V1, V2", it fails with "illegal or missing addressing mode for symbol V0". 

I have no particular instruction to add current. But I feel many library could be speeded up by SIMD.
--
Best regards,
Zhongwei

Anthony Martin

unread,
Jan 4, 2016, 5:55:32 AM1/4/16
to Zhongwei Yao, David Chase, Keith Randall, golang-dev
Zhongwei Yao <zhongw...@linaro.org> once said:
> And there is no "BYTE" instruction support on ARM64.

Use the "WORD" instruction on ARM64.

Anthony

Dave Cheney

unread,
Jan 4, 2016, 8:02:14 AM1/4/16
to Anthony Martin, Zhongwei Yao, David Chase, Keith Randall, golang-dev

Instructions are added on an as needs basis. Please document the precise instructions you need.


minux

unread,
Jan 4, 2016, 2:14:55 PM1/4/16
to Zhongwei Yao, Keith Randall, golang-dev, David Chase


On Jan 4, 2016 04:38, "Zhongwei Yao" <zhongw...@linaro.org> wrote:
> On the ARM64 golang platform, it seems not support ARM SIMD (NEON) instruction at all. And there is no "BYTE" instruction support on ARM64.

On RISC architectures, use WORD $inst for unsupported instructions.

Erwin Driessens

unread,
Jan 5, 2016, 6:20:41 AM1/5/16
to golang-dev, k...@google.com, zhongw...@linaro.org
Yesterday i needed the following unsupported SSE instructions to speed up a hotspot:
LDDQU
PSLLDQ
PSRLDQ
PMOVZXBD
PMULLD
PHADDD
PADDD
CVTDQ2PS
I was able to get them to work using the BYTE directive, but it's so cumbersome, and the resulting code isn't as clear as it could be. Also, experimentation is much hindered by the 'cast in stone' defines. Maybe i am doing things wrongly? Here's an example of my code:

// LDOU aka LDDQU - 11110010:00001111:11110000: modA xmmreg r/m
#define LDOU_AX_X0 BYTE $0xF2; BYTE $0x0F; BYTE $0xF0; BYTE $0x00;
#define LDOU_BX_X1 BYTE $0xF2; BYTE $0x0F; BYTE $0xF0; BYTE $0x0B;
// PSLLO ala PSLLDQ - 01100110:00001111:01110011:11111 xmmreg: imm8
#define PSLLO_X0_7 BYTE $0x66; BYTE $0x0F; BYTE $0x73; BYTE $0xF8; BYTE $0x07;
#define PSLLO_X1_7 BYTE $0x66; BYTE $0x0F; BYTE $0x73; BYTE $0xF9; BYTE $0x07;
// PSRLO aka PSRLDQ - 01100110:00001111:01110011:11011 xmmreg: imm8
#define PSRLO_X3_7 BYTE $0x66; BYTE $0x0F; BYTE $0x73; BYTE $0xDB; BYTE $0x07;
#define PSRLO_X3_4 BYTE $0x66; BYTE $0x0F; BYTE $0x73; BYTE $0xDB; BYTE $0x04;
// PMOVZXBD - 01100110:00001111:00111000:00110001:11 xmmreg1 xmmreg2 
#define PMOVZXBD_X3_X0 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x31; BYTE $0xC3;
#define PMOVZXBD_X3_X1 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x31; BYTE $0xCB;
#define PMOVZXBD_X3_X2 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x31; BYTE $0xD3;
// PMULLD - 01100110:00001111:00111000:01000000:11 xmmreg1 xmmreg2
#define PMULLD_X0_X0 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x40; BYTE $0xC0
#define PMULLD_X1_X1 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x40; BYTE $0xC9
#define PMULLD_X2_X2 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x40; BYTE $0xD2
// PHADDD - 01100110:00001111:00111000:00000010:11 xmmreg1 xmmreg2
#define PHADDD_X0_X0 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x02; BYTE $0xC0
#define PHADDD_X1_X1 BYTE $0x66; BYTE $0x0F; BYTE $0x38; BYTE $0x02; BYTE $0xC9
// PADDD - 66 0F FE /r
#define PADDD_X1_X0 BYTE $0x66; BYTE $0x0F; BYTE $0xFE; BYTE $0xC1
#define PADDD_X2_X0 BYTE $0x66; BYTE $0x0F; BYTE $0xFE; BYTE $0xC2
// CVTDQ2PS - 00001111:01011011:11 xmmreg1 xmmreg2
#define CVTDQ2PS_X0_X0 BYTE $0x0F; BYTE $0x5B; BYTE $0xC0;

//// Env3x3Gray_distance ///////////////////////////////////////////////////////

// func Env3x3Gray_distance(env0, env1 *Env3x3Gray) float32
TEXT ·Env3x3Gray_distance(SB),NOSPLIT,$0-24
MOVQ env0+0(FP), AX // get env0 pointer
MOVQ env1+8(FP), BX // get env1 pointer
// load 16 bytes of data per env (7 too many)
LDOU_AX_X0 /* MOVOU (AX), X0 */
LDOU_BX_X1 /* MOVOU (BX), X1 */
// keep bytes [0..8], clear bytes[9..15], by left shifting
PSLLO_X0_7 // shift left 7 bytes
PSLLO_X1_7 // shift left 7 bytes
// get pairwise maxima and minima
MOVOA X1, X2
PMINUB X0, X2 // minima in X2
MOVOA   X1, X3
PMAXUB  X0, X3 // maxima in X3
// compute deltas
PSUBB X2, X3 // deltas (max - min) in X3
PSRLO_X3_7 // shift right to align
// widen deltas (uint8 -> int32), 4 elements per operation
PMOVZXBD_X3_X0
PSRLO_X3_4 // shift right 4 bytes
PMOVZXBD_X3_X1
PSRLO_X3_4 // shift right 4 bytes
PMOVZXBD_X3_X2
// square deltas
PMULLD_X0_X0
PMULLD_X1_X1
PMULLD_X2_X2
// add deltas (X2 has just one element, needs no horizontal add)
PHADDD_X0_X0
PHADDD_X0_X0
PHADDD_X1_X1
PHADDD_X1_X1
PADDD_X1_X0
PADDD_X2_X0
// final floating point processing
CVTDQ2PS_X0_X0 // int32 -> float32
SQRTSS   X0, X0 // square root
// return normalized distance
MOVSS $0.001307189542484, X1 // factor: 1.0/sqrt(9 * 255*255)
MULSS X1, X0
MOVSS    X0, ret+16(FP)
RET


Russ Cox

unread,
Jan 5, 2016, 6:51:48 AM1/5/16
to Erwin Driessens, golang-dev, Keith Randall, zhongw...@linaro.org
On Tue, Jan 5, 2016 at 6:20 AM, Erwin Driessens <snes...@gmail.com> wrote:
Yesterday i needed the following unsupported SSE instructions to speed up a hotspot:
LDDQU
PSLLDQ
PSRLDQ
PMOVZXBD
PMULLD
PHADDD
PADDD
CVTDQ2PS
I was able to get them to work using the BYTE directive, but it's so cumbersome, and the resulting code isn't as clear as it could be. Also, experimentation is much hindered by the 'cast in stone' defines. Maybe i am doing things wrongly?

No, it really is that bad. Thanks for the list of instructions. I'll try to get those in for Go 1.6.

Russ

Donovan Hide

unread,
Jan 6, 2016, 7:07:46 PM1/6/16
to Russ Cox, Erwin Driessens, golang-dev, Keith Randall, zhongw...@linaro.org
Please also consider POPCNT, TZCNT and LZCNT. 


Many thanks!

--

ilya....@intel.com

unread,
Jan 11, 2016, 10:42:22 AM1/11/16
to golang-dev, snes...@gmail.com, k...@google.com
I can do this, if Russ hasn't started.
If there are other AMD64 instructions that you need in the future, me or someone else from the Intel team will be happy to add them.

BTW Russ,
I've already have some new instructions waiting for code unfreeze, should I send them now?

вторник, 5 января 2016 г., 14:51:48 UTC+3 пользователь rsc написал:

Russ Cox

unread,
Jan 11, 2016, 12:13:47 PM1/11/16
to Tocar, Ilya, golang-dev, Erwin Driessens, Keith Randall
On Mon, Jan 11, 2016 at 10:42 AM, <ilya....@intel.com> wrote:
I can do this, if Russ hasn't started.
If there are other AMD64 instructions that you need in the future, me or someone else from the Intel team will be happy to add them.

I have started. I'll take care of these. There is more general cleanup and testing that needs to be done at the same time.
 
BTW Russ,
I've already have some new instructions waiting for code unfreeze, should I send them now?

Can you remind me what the CLs are?

Thanks.
Russ
Reply all
Reply to author
Forward
0 new messages