[RFC] SSA-level SLP autovectorization pass for go compiler

Arseny Samoylov

unread,

May 26, 2026, 3:24:08 PMMay 26

to golang-dev

Hi,
I've been working on SSA-level SLP (Superword Level Parallelism) autovectorization pass for go compiler and would appreciate feedback on whether this direction seems worth pursuing upstream.

Thanks to the current work on GOEXPERIMENT=simd, the current prototype swiftly operates on architecture-independent SSA before lowering. I primarily work and test it on arm64, but it also works on amd64 (I expect it to also work without a problem on wasm, but I hadn't checked it yet).

The implementation is intentionally conservative for now:

basic-block local
no shuffles/permutations
no reductions
no profitability/cost model
fixed-width (128-bit) vectors only

At the moment the prototype successfully vectorizes several simple arithmetic kernels on amd64 and arm64.

As a simple example, consider:
```
type Vec struct {
E [4]float64
}

func (v Vec) Dot(v2 Vec) float64 {
return v.E[0]*v2.E[0] + v.E[1]*v2.E[1] + v.E[2]*v2.E[2] + v.E[3]*v2.E[3]
}
```

is vectorized to:

4 vector loads
2 vector multiplies
scalar extraction for the final reduction (the reduction itself is not currently vectorized)

Even with these limitations, the prototype already identifies several hundred vectorization regions in real codebases.:

runtime tests
```
# In the GOROOOT/src
GOEXPERIMENT=slp go test -a -c -o a.out -gcflags=all=-d=ssa/slp/debug=2 runtime/ &> runtime_build.log
grep "Commit point" runtime_build.log | wc -l
225
```

etcd
```
GOEXPERIMENT=slp go build -a -o a.out -gcflags=all=-d=ssa/slp/debug=2 &> etcd_build.log
grep "Commit point" etcd_build.log | wc -l
738
```

cmd/comile itself
```
# In the GOROOT/src
GOEXPERIMENT=slp go build -a -o a.out -gcflags=all=-d=ssa/slp/debug=2 cmd/compile/ &> compile_build.log
grep "Commit point" compile_build.log | wc -l
3407
```

For the compile-time overhead I plan to evaluate the impact using the sweet/go-build benchmark suite.

Testing currently relies on test/codegen assembly and compiler output checks. Also there some test wrappers for semantic correctness validation. I also plan to add SSA-level tests.

Since this is a big change, I would appreciate feedback on whether this direction aligns with the current optimization direction of go compiler.

This work is also part of my academic research, so I intend to continue iterating on correctness, benchmarks, and design. And early feedback would help me align that work with the compiler's needs and priorities.

I would also be glad to prepare a more detailed design document if this direction appears promising.

My current plans for the first prototype:

Write more extensive tests (current ones a rather shallow)
Write synthetic benchmarks for the compiler test suite
Enable other vector width
Enable scalar-to-vector packing support
- Currently, I have only supported vector-to-scalar unpacking.

Vectorize slice with non-constant index access.
- For example: load s[0], load s[1] currently can be vectorized, while load s[i], load s[i+1] - are not.

Future plans:

More aggressively vectorize Load/Store.
- Current implementation is very conservative with combining operations that affect (or affected by) memory state
Cross block vectorization
Vectorize reductions.

Here is CL with current implementation for reference.

Thanks,
Arseny Samoylov

Goran

unread,

May 26, 2026, 4:35:22 PMMay 26

to golang-dev

Thanks for working on this — SLP comes up a lot and it's good to have a concrete prototype to look at.

> no profitability/cost model

> already identifies several hundred vectorization regions

The region count isn't really the number that matters; what matters is whether it's a net win after regalloc and lowering. Without a cost model SLP regresses pretty easily — unaligned loads, the insert/extract around the scalar reduction, extra vector pressure — and a bb-local packer is exactly where you have the least info to tell. So the first thing I'd want to see is benchstat, not region counts: a bent run (or at least the go1/x benchmarks) on amd64 and arm64, GOEXPERIMENT=slp vs off, -count high enough that the p-values mean something, run under perflock. And please call out the worst per-benchmark regression, not just the geomean — a +N% geomean hiding a -20% on one hot loop is the case everyone worries about.

Separately, compile time. The compiler is very sensitive here and SLP seed-finding/tree-building isn't free. compilebench against a toolstash baseline (-compile $(toolstash -n compile)) would give the per-package delta; the tail matters more than the average, since the pass will do the most work on the packages with the biggest functions. sweet/go-build is fine for the macro number, but I'd want the compilebench breakdown too.

And I'd pull the cost model forward in the roadmap. Right now it reads as "later," but it's the part that decides whether the rest is worth doing, and it's hard to evaluate the pass without it.

Will take a look at the CL.

Arseny Samoylov

unread,

May 27, 2026, 11:46:31 AMMay 27

to golang-dev

(Not sure where my reply is, so I'll duplicate it just in case it got lost)

> The region count isn't really the number that matters

Agreed. I used it mainly to demonstrate that the current version is functional and successfully vectorizes code -- for example, the compiler is able to bootstrap itself with vectorization enabled.

> Without a cost model SLP regresses pretty easily

Agreed. For that reason, the current implementation is intentionally conservative: no shuffles, no scalar-to-vector transforms, and very limited vector-to-scalar transforms.

For reference, see the comments on the analysis stage (link) and validation stage (link).

> So the first thing I'd want to see is benchstat

The end goal is definitely performance improvement. However, for an initial implementation, I think there are several other important concerns to address first:

Correctness -- preserving existing program semantics.
Portability -- ensuring generated code remains sufficiently portable.
Maintainability -- keeping the implementation simple, comprehensible, and robust.

Ideally, I see this happening in two stages: first, establishing groundwork with a simple and reliable implementation that prioritizes stability and clarity; second, iterating toward performance improvements.

> Separately, compile time. The compiler is very sensitive here and SLP seed-finding/tree-building isn't free.

Agreed.

For the current prototype, compile-time overhead has not been a primary focus. In a few places, I intentionally prioritized development speed and simplicity over performance, using more brute-force approaches to validate the design and iterate faster. There are several clearly suboptimal areas that would need optimization before considering production use.

> compilebench against a toolstash baseline (-compile $(toolstash -n compile)) would give the per-package delta

That is interesting — I had not come across this before, and I should definitely try it.

> the tail matters more than the average, since the pass will do the most work on the packages with the biggest functions

Agreed -- I have already run into this (see my comment here).

> sweet/go-build is fine for the macro number

Interestingly, sweet/go-build currently shows almost no degradation. That seems better than expected, so I want to investigate it further.

> And I'd pull the cost model forward in the roadmap. Right now it reads as "later," but it's the part that decides whether the rest is worth doing, and it's hard to evaluate the pass without it.

I agree that the cost model is important. My thinking is that, with the current conservative approach, it may not yet be strictly necessary for an initial version to have a cost model, though it should likely move earlier in the roadmap.

Arseny Samoylov

unread,

May 27, 2026, 11:46:39 AMMay 27

to golang-dev

> The region count isn't really the number that matters;

Agreed. I used it mainly to demonstrate that the current version is functional and successfully vectorizes code -- for example, the compiler is able to bootstrap itself with vectorization enabled.

> Without a cost model SLP regresses pretty easily

Agreed. For the reason, the current implementation is intentionally conservative: no shuffles, no scalar-to-vec transforms, and very limited vec-to-scalar transforms.

For reference, see the comments on analysis stage (link) and validation stage (link).

> So the first thing I'd want to see is benchstat

The end goal is definitely performance improvement. However, for an initial implementation, I think there are several other important concerns to address first:

Correctness -- preserving existing program semantics.
Portability -- ensuring generated code remains sufficiently portable

Maintainability -- keeping the implementation simple, comprehensible, and robust.

Ideally, I see this happening in two stages: first, establishing groundwork with a simple and reliable implementation that prioritizes stability and clarity; second, iterating toward performance improvements.

> Separately, compile time. The compiler is very sensitive here and SLP seed-finding/tree-building isn't free.

Agreed.

For the current prototype, compile time overhead has not been a primary focus. In a few places, I intentionally prioritized development speed and simplicity over performance, using more brute-force approaches to validate the design and iterate faster. There are several clearly suboptimal places that would need optimization.

> compilebench against a toolstash baseline (-compile $(toolstash -n compile)) would give the per-package delta;

That's interesting, I had not come across this before, so I definitely should try it.

> the tail matters more than the average, since the pass will do the most work on the packages with the biggest functions.

Agreed -- I have already run into this (see my comment here).

> sweet/go-build is fine for the macro number,

Interestingly, sweet/go-build currently shows almost no degradation. That seems better than expected, so I want to investigate it further.

> And I'd pull the cost model forward in the roadmap. Right now it reads as "later," but it's the part that decides whether the rest is worth doing, and it's hard to evaluate the pass without it.

I agree that the cost model is important. My thinking is that, with the current conservative approach, it may not yet be strictly necessary for an initial version to have a cost model, though I agree it should likely move earlier in the roadmap.

On Tuesday, 26 May 2026 at 23:35:22 UTC+3 Goran wrote:

Arseny Samoylov

unread,

May 27, 2026, 1:59:32 PMMay 27

to Goran, golang-dev

(Not sure where my reply is, and this is third time I am trying to send it, this time, using gmail instead of Web Groups interface)

> The region count isn't really the number that matters

Agreed. I used it mainly to demonstrate that the current version is functional and successfully vectorizes code -- for example, the compiler is able to bootstrap itself with vectorization enabled.

> Without a cost model SLP regresses pretty easily

Agreed. For that reason, the current implementation is intentionally conservative: no shuffles, no scalar-to-vector transforms, and very limited vector-to-scalar transforms.
For reference, see the comments on the analysis stage (link) and validation stage (link).

> So the first thing I'd want to see is benchstat

The end goal is definitely performance improvement. However, for an initial implementation, I think there are several other important concerns to address first:

Correctness -- preserving existing program semantics.

Portability -- ensuring generated code remains sufficiently portable.

Maintainability -- keeping the implementation simple, comprehensible, and robust.

Ideally, I see this happening in two stages: first, establishing groundwork with a simple and reliable implementation that prioritizes stability and clarity; second, iterating toward performance improvements.

> Separately, compile time. The compiler is very sensitive here and SLP seed-finding/tree-building isn't free.

Agreed.

For the current prototype, compile-time overhead has not been a primary focus. In a few places, I intentionally prioritized development speed and simplicity over performance, using more brute-force approaches to validate the design and iterate faster. There are several clearly suboptimal areas that would need optimization before considering production use.

> compilebench against a toolstash baseline (-compile $(toolstash -n compile)) would give the per-package delta

That is interesting — I had not come across this before, and I should definitely try it.

> the tail matters more than the average, since the pass will do the most work on the packages with the biggest functions

Agreed -- I have already run into this (see my comment here).

> sweet/go-build is fine for the macro number

Interestingly, sweet/go-build currently shows almost no degradation. That seems better than expected, so I want to investigate it further.

> And I'd pull the cost model forward in the roadmap. Right now it reads as "later," but it's the part that decides whether the rest is worth doing, and it's hard to evaluate the pass without it.

I agree that the cost model is important. My thinking is that, with the current conservative approach, it may not yet be strictly necessary for an initial version to have a cost model, though it should likely move earlier in the roadmap.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-dev/4YKD-sMVQUU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-dev+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-dev/967af120-6644-4f25-bfe6-359f1be9fb2en%40googlegroups.com.

Arseny Samoylov

unread,

Jun 4, 2026, 4:57:30 AM (7 days ago) Jun 4

to Goran, golang-dev

Hi,

Here is a small update with some preliminary sweet benchmark results.

Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
                       │ base.results │             slp.results             │
                       │    sec/op    │    sec/op     vs base               │
BleveIndexBatch100-4      5.867 ±  2%    5.842 ±  2%       ~ (p=0.971 n=10)
ESBuildThreeJS-4         685.0m ±  2%   686.9m ±  1%       ~ (p=0.853 n=10)
ESBuildRomeTS-4          179.6m ±  1%   180.2m ±  1%       ~ (p=0.353 n=10)
EtcdPut-4                138.7m ± 37%   134.6m ± 19%       ~ (p=0.315 n=10)
EtcdSTM-4                526.0m ±  4%   531.9m ±  2%       ~ (p=0.579 n=10)
GoBuildKubelet-4          145.5 ±  1%    146.5 ±  0%  +0.73% (p=0.000 n=10)
GoBuildKubeletLink-4      9.647 ±  0%    9.670 ±  0%       ~ (p=0.739 n=10)
GoBuildIstioctl-4         117.4 ±  1%    118.3 ±  0%  +0.80% (p=0.000 n=10)
GoBuildIstioctlLink-4     10.13 ±  0%    10.09 ±  0%  -0.41% (p=0.019 n=10)
GoBuildFrontend-4         41.29 ±  1%    41.80 ±  0%  +1.22% (p=0.000 n=10)
GoBuildFrontendLink-4     1.613 ±  2%    1.619 ±  1%       ~ (p=0.971 n=10)
GoBuildTsgo-4             66.31 ±  1%    66.02 ±  1%       ~ (p=0.052 n=10)
GoBuildTsgoLink-4        842.0m ±  1%   838.3m ±  1%       ~ (p=0.089 n=10)
GopherLuaKNucleotide-4    27.62 ±  1%    27.63 ±  1%       ~ (p=0.739 n=10)
MarkdownRenderXHTML-4    237.6m ±  1%   239.7m ±  1%  +0.89% (p=0.001 n=10)
Tile38QueryLoad-4        532.3µ ±  0%   523.3µ ±  0%  -1.69% (p=0.000 n=10)
geomean                   2.391          2.390        -0.03%

HiSilicon Kunpeng-920
                       │ base.results │            slp.results             │
                       │    sec/op    │   sec/op     vs base               │
BleveIndexBatch100-4       7.467 ± 1%    7.513 ± 2%       ~ (p=0.165 n=10)
ESBuildThreeJS-4          755.1m ± 1%   754.5m ± 1%       ~ (p=0.912 n=10)
ESBuildRomeTS-4           195.5m ± 2%   194.0m ± 1%       ~ (p=0.123 n=10)
EtcdPut-4                 55.08m ± 1%   54.69m ± 1%       ~ (p=0.165 n=10)
EtcdSTM-4                 292.2m ± 1%   291.4m ± 1%       ~ (p=0.436 n=10)
GoBuildKubelet-4           157.7 ± 0%    158.7 ± 0%  +0.59% (p=0.000 n=10)
GoBuildKubeletLink-4       12.54 ± 2%    12.51 ± 1%       ~ (p=0.247 n=10)
GoBuildIstioctl-4          123.8 ± 0%    124.0 ± 0%  +0.17% (p=0.011 n=10)
GoBuildIstioctlLink-4      8.517 ± 1%    8.525 ± 0%       ~ (p=0.529 n=10)
GoBuildFrontend-4          45.04 ± 0%    45.55 ± 1%  +1.14% (p=0.000 n=10)
GoBuildFrontendLink-4      2.134 ± 1%    2.135 ± 1%       ~ (p=0.739 n=10)
GoBuildTsgo-4              75.66 ± 0%    75.74 ± 1%       ~ (p=0.796 n=10)
GoBuildTsgoLink-4          1.162 ± 1%    1.165 ± 1%       ~ (p=0.631 n=10)
GopherLuaKNucleotide-4     33.30 ± 3%    32.97 ± 1%       ~ (p=0.075 n=10)
MarkdownRenderXHTML-4     266.0m ± 0%   267.2m ± 0%  +0.45% (p=0.001 n=10)
Tile38QueryLoad-4         607.9µ ± 0%   608.0µ ± 0%       ~ (p=0.739 n=10)
geomean                    2.450         2.450       +0.03%

On both x86 and Arm64, the current prototype shows essentially no measurable change in overall performance.

There is small degradation in the go-build benchmark, which is expected, as the current implementation has not yet been optimized for the compile-time overhead.

One result that stood out is Tile38QueryLoad on x86, which shows a small improvement.
I reran the benchmark separately to verify that the improvement persists, and the result appear stable across the runs:

(rerun of Tile38QueryLoad benchmark on x86)
                  │ base.results │            slp.results             │
                  │    sec/op    │   sec/op     vs base               │
Tile38QueryLoad-4    531.3µ ± 0%   523.4µ ± 0%  -1.49% (p=0.000 n=10)

                  │  base.results   │              slp.results               │
                  │ p50-latency-sec │ p50-latency-sec  vs base               │
Tile38QueryLoad-4       251.3µ ± 0%       250.6µ ± 0%  -0.25% (p=0.005 n=10)

                  │  base.results   │              slp.results               │
                  │ p90-latency-sec │ p90-latency-sec  vs base               │
Tile38QueryLoad-4       862.6µ ± 0%       847.7µ ± 0%  -1.73% (p=0.000 n=10)

                  │  base.results   │              slp.results               │
                  │ p99-latency-sec │ p99-latency-sec  vs base               │
Tile38QueryLoad-4       4.998m ± 1%       4.803m ± 1%  -3.89% (p=0.000 n=10)

                  │ base.results │            slp.results             │
                  │    ops/s     │    ops/s     vs base               │
Tile38QueryLoad-4    5.646k ± 0%   5.731k ± 0%  +1.51% (p=0.000 n=10)

Here is how I ran the benchmarks:

taskset -c 44-47 ./sweet run -shell -work-dir `pwd`/tmp config.toml 2>&1 | tee sweet.log
# Separate rerun for tile38 on x86
taskset -c 44-47 ./sweet run -run=tile38 -shell -work-dir `pwd`/tmp config.toml 2>&1 | tee sweet.log

config.toml:

[[config]]
  name = "base"
  goroot = "/home/asamoylov/go-upstream"

[[config]]
  name = "slp"
  goroot = "/home/asamoylov/go-slp"
  envbuild = ["GOFLAGS=-d=ssa/slp/debug=2"]

Notes:

In my environment cocroachdb benchmark fails for some reason (regardless of the SLP)
For the go-build benchmark, I had to use a separate compiler build with SLP enabled by default (thats why "slp" config has different goroot and doesn't need GOEXPERIMENT=slp)
GOFLAGS=-d=ssa/slp/debug=2 is not particular useful without `=all`, which it does not accept

Also, my previous reply sent through Groups web interface appears to have gotten stuck and was only published after I replied via Gmail.
As a result, the thread now contains three copies of the same reply.

The Gmail version (the third one) does not contain proper links, so please refer to one of the first two copies.

I hope benchmarks tables will be displayed correctly. Pasting them as plain text looks rather messy.

Reply all

Reply to author

Forward