With Go, we can use our multi core CPU quite effectively by using goroutine. We tell `go` and a function and that function gets executed in parallel. What we can’t do today is tell Go to use the full capacity of one core by processing data in parallel using SIMD instructions.
This is what language like Mojo ( https://mojolang.org/ ), or ISPC ( https://ispc.github.io/ispc.html ) or CUDA, OpenCL, Vulkan and friends provide.In the case of Go, this would enable Go to stay idiomatic, readable and portable while still being efficient by being told this code can process data in parallel effectively. This would speed data processing up to 32 times for byte on modern Intel CPU.
Go already has a logical place to add this capability. If we allow the `go` keyword to be followed by `for` `range`, it would enable data parallelism without breaking existing Go code and keep it logical, readable and maintainable. This would look like this:
```
func contains(data []int32, target int32) bool {
go for _, v := range data {
found := v == target
if reduce.Any(found) {
return true
}
}
return false
}
```
This above function will be on average 8 times faster on Intel CPU with AVX2 than this current go version:
```
func contains(data []int32, target int32) bool {
for _, v := range data {
if v == target {
return true
}
}
return false
}
```
To make it easier for people to understand it in practice, I built a proof of concept that allows you to experiment using tinygo and its playground. I enable mostly two target wasm and avx2 which you can explore the live result in the browser for wasm and see the generated assembly for AVX2. I also put together a blog that goes over the result here: https://bluebugs.github.io/blogs/spmd-results/ . The playground is here: https://gofor-tinygo.netlify.app/ .
The short version is that by expressing parallelism at the loop level and in an high level form, the compiler can become quite competitive even with intrinsic hand written assembly as the performance range between 80% to 100% of the best hand written library, but you keep the readability and portability of Go. The compiler doesn’t get any slower as this is mechanical transformation in the same way as normal go code. It is robust as a `go for` is always able to become SIMD code.