[RFC] Faster loop with data parallelism in Go

Cedric BAIL

unread,

Jul 1, 2026, 6:50:15 PMJul 1

to golang-nuts

With Go, we can use our multi core CPU quite effectively by using goroutine. We tell `go` and a function and that function gets executed in parallel. What we can’t do today is tell Go to use the full capacity of one core by processing data in parallel using SIMD instructions.

This is what language like Mojo ( https://mojolang.org/ ), or ISPC ( https://ispc.github.io/ispc.html ) or CUDA, OpenCL, Vulkan and friends provide.In the case of Go, this would enable Go to stay idiomatic, readable and portable while still being efficient by being told this code can process data in parallel effectively. This would speed data processing up to 32 times for byte on modern Intel CPU.

Go already has a logical place to add this capability. If we allow the `go` keyword to be followed by `for` `range`, it would enable data parallelism without breaking existing Go code and keep it logical, readable and maintainable. This would look like this:

```

func contains(data []int32, target int32) bool {

go for _, v := range data {

found := v == target

if reduce.Any(found) {

return true

}

return false

}

```

This above function will be on average 8 times faster on Intel CPU with AVX2 than this current go version:

```

func contains(data []int32, target int32) bool {

for _, v := range data {

if v == target {

return true

}

return false

}

```

To make it easier for people to understand it in practice, I built a proof of concept that allows you to experiment using tinygo and its playground. I enable mostly two target wasm and avx2 which you can explore the live result in the browser for wasm and see the generated assembly for AVX2. I also put together a blog that goes over the result here: https://bluebugs.github.io/blogs/spmd-results/ . The playground is here: https://gofor-tinygo.netlify.app/ .

The short version is that by expressing parallelism at the loop level and in an high level form, the compiler can become quite competitive even with intrinsic hand written assembly as the performance range between 80% to 100% of the best hand written library, but you keep the readability and portability of Go. The compiler doesn’t get any slower as this is mechanical transformation in the same way as normal go code. It is robust as a `go for` is always able to become SIMD code.

Cedric

Axel Wagner

unread,

Jul 2, 2026, 2:13:39 AMJul 2

to Cedric BAIL, golang-nuts

Are you aware that Go has experimental support for SIMD intrinsics? I think it is unlikely that we would change the language to provide something, that we have intrinsics for.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/b211707e-92eb-4c24-bc5a-a941940ab0b0n%40googlegroups.com.

Cedric BAIL

unread,

Jul 3, 2026, 12:21:19 AMJul 3

to Axel Wagner, golang-nuts

Yes, I am. For reference, this is exactly doing the same things as my example: https://github.com/samber/lo/blob/master/exp/simd/intersect_avx512.go#L76

As you will note, SIMD intrinsic is lower level. The loops are not part of the proposal for SIMD intrinsics and result in having to not forget to do the manual loop peeling for example (that little for at the end of the function). Additionally, as you can see, a lot of boiler plate is required, because it is lower level, as the compiler has no understanding of what this is trying to do. Which also is why you have use of unsafe everywhere in that file.

This is not to say that intrinsics is not a necessary tool, just that you likely will not want to use it everywhere in your code base due to this constraints. If we had data parallelism available in the language, it would likely be something we would reach for more often with also a lower barrier.

Cedric

Reply all

Reply to author

Forward