So, I was digging around in internal/asm a week ago due to a few reasons. One thing I noticed was that the assembly code didn't contain any optimizations with regards to AVX/NEON etc. usage. I realized that should be possible to write similar Go code -- without using assembly.
The main performance from that code seems to come from two sources:
1. avoiding bounds checks,
2. unrolling the loop a few times.
It's possible to do both in Go, however, it would be error prone.
AxpyAssembly-32 = 117.6µ ± 1% // current implementation assembly implementation
AxpyBasic-32 = 151.0µ ± 1% // naive implementation in go
AxpyUnsafe-32 = 93.06µ ± 0% // naive without bounds checks
AxpyBasicR4-32 = 109.7µ ± 1% // naive with unrolling
AxpyUnsafeR4-32 = 101.5µ ± 1% // naive without bounds checks and unrolling
AxpyPointerLoopR4-32 = 81.90µ ± 1% // using pointer iterators
It was interesting to see that a naive implementation without bounds checks ended up being faster than the current assembly implementation. However there were even better performing implementations. However, the best performing approach seemed to differ on amd64 and arm64 -- so it would be nice if someone can rerun those benchmarks locally on different machines.
Either way, it seems there are performance wins available with simplified codebase. As previously mentioned, rewriting all of them manually seems error prone.
Much of the code in there can be roughly simplified to "do an operation across multiple slices, with different strides". So, I tried writing a code generator that does exactly that, and nothing more
https://github.com/egonelbre/exp/blob/vec/vector/generate/example.go#L37. It can generate bunch of variants of those loops. I'm sure the code can be simplified and there are bugs, but should be sufficient for a proof-of-concept.
Of course, none of this would exclude future improvements for AVX, NEON. And I suspect, similarly, much of the "loop handling" could be reused and only the actual operations themselves would need be different.