Proposal to generate code in internal/asm

Egon

unread,

Dec 1, 2023, 7:38:36 AM12/1/23

to gonum-dev

So, I was digging around in internal/asm a week ago due to a few reasons. One thing I noticed was that the assembly code didn't contain any optimizations with regards to AVX/NEON etc. usage. I realized that should be possible to write similar Go code -- without using assembly.

The main performance from that code seems to come from two sources:

1. avoiding bounds checks,

2. unrolling the loop a few times.

It's possible to do both in Go, however, it would be error prone.

I manually wrote a few variants in pure go that could be compared: https://github.com/egonelbre/exp/blob/vec/vector/compare/axpy.go

AxpyAssembly-32 = 117.6µ ± 1% // current implementation assembly implementation
AxpyBasic-32 = 151.0µ ± 1% // naive implementation in go
AxpyUnsafe-32 = 93.06µ ± 0% // naive without bounds checks
AxpyBasicR4-32 = 109.7µ ± 1% // naive with unrolling
AxpyUnsafeR4-32 = 101.5µ ± 1% // naive without bounds checks and unrolling

AxpyPointerLoopR4-32 = 81.90µ ± 1% // using pointer iterators

It was interesting to see that a naive implementation without bounds checks ended up being faster than the current assembly implementation. However there were even better performing implementations. However, the best performing approach seemed to differ on amd64 and arm64 -- so it would be nice if someone can rerun those benchmarks locally on different machines.

Either way, it seems there are performance wins available with simplified codebase. As previously mentioned, rewriting all of them manually seems error prone.

Much of the code in there can be roughly simplified to "do an operation across multiple slices, with different strides". So, I tried writing a code generator that does exactly that, and nothing more https://github.com/egonelbre/exp/blob/vec/vector/generate/example.go#L37. It can generate bunch of variants of those loops. I'm sure the code can be simplified and there are bugs, but should be sufficient for a proof-of-concept.

Of course, none of this would exclude future improvements for AVX, NEON. And I suspect, similarly, much of the "loop handling" could be reused and only the actual operations themselves would need be different.

Proposal in github: https://github.com/gonum/gonum/issues/1926

Dan Kortschak

unread,

Dec 1, 2023, 1:14:32 PM12/1/23

to gonu...@googlegroups.com

On Fri, 2023-12-01 at 04:38 -0800, Egon wrote:
> It was interesting to see that a naive implementation without bounds
> checks ended up being faster than the current assembly
> implementation. However there were even better performing
> implementations. However, the best performing approach seemed to
> differ on amd64 and arm64 -- so it would be nice if someone can rerun
> those benchmarks locally on different machines.

I can re-run the benchmarks on linux arm64 over the next few days. Do
you have a preferred benchmark base version that you'd like to compare
to?

Dan

Egon Elbre

unread,

Dec 1, 2023, 3:08:56 PM12/1/23

to Dan Kortschak, gonu...@googlegroups.com

Ah; no need for the base version. It contains all the different
implementations in a single package.

`go test -bench .` should is sufficient in
https://github.com/egonelbre/exp/tree/vec/vector/compare

> --
> You received this message because you are subscribed to a topic in the Google Groups "gonum-dev" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/gonum-dev/50fA6bT-6BU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to gonum-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gonum-dev/bdef4b76b11b94e7a0c8c8e7f27d1eff94ea2fc4.camel%40kortschak.io.

Dan Kortschak

unread,

Dec 1, 2023, 3:37:12 PM12/1/23

to Egon Elbre, gonu...@googlegroups.com

On Fri, 2023-12-01 at 22:08 +0200, Egon Elbre wrote:
> Ah; no need for the base version. It contains all the different
> implementations in a single package.
>
> `go test -bench .` should is sufficient in
> https://github.com/egonelbre/exp/tree/vec/vector/compare

goos: linux
goarch: arm64
pkg: github.com/egonelbre/exp/vector/compare
BenchmarkAxpyBasic-4 231 5223358 ns/op
BenchmarkAxpyUnsafe-4 230 5175046 ns/op
BenchmarkAxpyUnsafeInline-4 207 5692477 ns/op
BenchmarkAxpyPointer-4 273 4674033 ns/op
BenchmarkAxpyPointerLoop-4 267 4444808 ns/op
BenchmarkAxpyBasicR4-4 228 5146632 ns/op
BenchmarkAxpyUnsafeR4-4 228 5164803 ns/op
BenchmarkAxpyUnsafeInlineR4-4 230 5189752 ns/op
BenchmarkAxpyUnsafeInlineR8-4 228 5244035 ns/op
BenchmarkAxpyPointerR4-4 232 5148967 ns/op
BenchmarkAxpyPointerLoopR4-4 218 5202664 ns/op
BenchmarkAxpyPointerLoopInterleaveR4-4 228 5332293 ns/op
BenchmarkAxpyPointerR4Alt-4 230 5202840 ns/op
PASS
ok github.com/egonelbre/exp/vector/compare 22.366s

Egon Elbre

unread,

Dec 2, 2023, 12:58:47 AM12/2/23

to Dan Kortschak, gonu...@googlegroups.com

Oh, interesting. Can you also provide the CPU details for it?
Also, which Go version? (i.e. ensure it's being tested with the latest)

It's interesting that loop unrolling doesn't seem to help.

Dan Kortschak

unread,

Dec 2, 2023, 2:11:54 AM12/2/23

to gonu...@googlegroups.com

On Sat, 2023-12-02 at 07:58 +0200, Egon Elbre wrote:
> Oh, interesting. Can you also provide the CPU details for it?
> Also, which Go version? (i.e. ensure it's being tested with the
> latest)
>
> It's interesting that loop unrolling doesn't seem to help.
>

~ $ go version
go version go1.21.3 linux/arm64
~ $ go env
GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/home/user/.cache/go-build'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/user/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/user'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.21.3'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -
ffile-prefix-map=/tmp/go-build981678023=/tmp/go-build -gno-record-gcc-
switches'
~ $ cat /proc/cpuinfo
processor : 0
BogoMIPS : 38.40
Features : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 1
BogoMIPS : 38.40
Features : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 2
BogoMIPS : 38.40
Features : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 3
BogoMIPS : 38.40
Features : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

Reply all

Reply to author

Forward