Writing SIMD ASM

226 views
Skip to first unread message

Greg Lowe

unread,
Jan 20, 2022, 9:15:26 PM1/20/22
to golang-nuts
Hi team,

I'm a recent Gopher, and have had great success over the past year developing an insurance modelling application. The tooling is great, thanks to the team for creating it.

1) SIMD Workflow

I've got hot functions in my application which are doing element wise operations on float slices. Some are just element-wise addition, and multiplication, some are slightly more complicated.

I'm currently deploying on AWS Lambda X86 which as AX2 support (Xeon Haswell+), but I'm also experimenting with Arm64 (Graviton 2), and would also like to do some benchmarking on Graviton 3 (only available on EC2 ATM).

I've been experimenting with implementing the hot functions in Go's ASM dialect, and using some simple code generation, to handle all the repetition. Nothing fancy, not much more than string templating. The results have been pretty good, but the workflow is pretty slow.

As a side project I've been toying with the idea of writing a slightly more advanced tool, that could read a "SIMD kernel" written as a simple Go function with a specific form, and generate ASM implementations for it. No fancy optimisations, just loop unrolling and vector instructions.

For example:

import . asmgen

// Implementation in a generated .s file
func Foo(dst []float32, a float32, x, y []float32)

// AST used as input to ASM codegen
func kernelFoo(i int, dst []float32, a float32, x, y []float32) {
dst[i] = min(a * x[i], y[i])
}

In reality, I probably don't have the time to do that, but it does feel like something minimal that would actually cover most of my immediate use cases is not a huge amount of work.

I guess this is basically just a limited form of c2goasm . See: https://github.com/minio/c2goasm

So maybe I should just use that, however including big blobs of hex encoded ASM doesn't seem great either. See: https://github.com/apache/arrow/blob/master/go/parquet/internal/utils/min_max_neon_arm64.s

So apologies that this question is a bit vague and rambly. But the workflow for SIMD here is pretty slow, and it feels like there could be a better way to solve this. So I'm basically just reaching out to see if anyone else has been working on this, or thinking about it, or has ideas about better solutions.


2) Arm64 ASM Neon Instructions:

One problem that's come up, is there's a bunch of ARM instructions which aren't defined in Go's assembler. So it looks like I'm going to have to write some code to generate the hex for these. I can probably copy the approach used here: https://github.com/minio/asm2plan9s/blob/master/asm2plan9s_arm64.go

For example - I'm currently writing:

WORD $0x4E24D400 // fadd v0.4s, v0.4s, v4.4s

But would like write:

VFADD V0.S4, V0.S4, V4.S4

I see there's an existing issue to add a bunch of Neon floating point instructions:
https://github.com/golang/go/issues/41092

I actually spent a while having a go at adding the instructions myself, but couldn't figure it out.

I also see that there is also a proposal and a MR to refactor the Arm64 assembler.
https://github.com/golang/go/issues/44734

Is there any ongoing work there, or has that effort stalled?

Anyways, thanks for reading my big wall of text.

Cheers,
Greg.

eric...@arm.com

unread,
Jan 20, 2022, 9:57:45 PM1/20/22
to golang-nuts
> I see there's an existing issue to add a bunch of Neon floating point instructions:
https://github.com/golang/go/issues/41092
>
> I actually spent a while having a go at adding the instructions myself, but couldn't figure it out.
>
> I also see that there is also a proposal and a MR to refactor the Arm64 assembler.
https://github.com/golang/go/issues/44734
>
> Is there any ongoing work there, or has that effort stalled?

I'm working on these issues. The plan is refactoring the assembler first, the newly designed assembler should make it easier to add instructions. As you said, there are a large number of arm64 instructions (NEON, SVE) that are not supported and we want to spend as little effort as possible on this.

But this will take some time, although the code is already under review. If you want to use some of the instructions as soon as possible, please submit an issue, or use word to workaround it first.  Thanks~

Sebastien Binet

unread,
Jan 21, 2022, 2:27:36 AM1/21/22
to Greg Lowe, golang-nuts
FYI, for X86, there's also:

- https://github.com/mmcloughlin/avo

which takes a slightly lower level approach (by requiring people to use
a closer-to-x86-asm vocabulary).

avo could become the target of your SIMD tool once the kernel has been
parsed into an AST.

(and once avo has gained an ARM backend.)

-s

Greg Lowe

unread,
Jan 21, 2022, 4:48:13 PM1/21/22
to golang-nuts
> If you want to use some of the instructions as soon as possible

I can wait, and workaround with hardcoded words in the meantime.

> I'm working on these issues.

Glad to know that this is tracking along. 

I'm looking forward to trying out SVE on Graviton 3.
Reply all
Reply to author
Forward
0 new messages