Question on 'cmd/compile: handle partially overlapping assignments' behavior on ARM64.

130 views
Skip to first unread message

Josh Peterson

unread,
Sep 19, 2022, 10:01:37 PM9/19/22
to golang-nuts

I stumbled across a performance regression on ARM64 for go 1.18.6 and 1.19.1 that wasn't present in earlier releases. In the following benchmark, you can see `BenchmarkSliceOfArray` experiences a 70% increase in 

execution time while `BenchmarkSliceOfInt` remains unchanged. 


BenchmarkSliceOfArray 413121 2855 ns/op

BenchmarkSliceOfArray 216478 4881 ns/op


Through the use of `pprof` and `GOSSAFUNC go build`, it was observed that the latest release employs `runtime.memmove` and accounts for the change in performance. I don't see the invocation of `memmove` or the perfromance degradation on AMD64.


I believe this can be traced back to - https://go-review.googlesource.com/c/go/+/425234


Is this new behavior in this scenario correct or unindented? 


Should I open an issue for this? 


I am a newbie, so forgive me if there are errors in my approach. 


----

go version go1.19 darwin/arm64

% GOMAXPROCS=1 go1.19 test -cpuprofile cpu.prof -bench .  

goos: darwin

goarch: arm64

pkg: foo/bar

BenchmarkSliceOfInt 572226 2066 ns/op

BenchmarkSliceOfArray 413121 2855 ns/op

PASS

ok foo/bar 3.488s

----

go version go1.19.1 darwin/arm64

% GOMAXPROCS=1 go test -cpuprofile cpu.prof -bench .   

goos: darwin

goarch: arm64

pkg: foo/bar

BenchmarkSliceOfInt 527084 2065 ns/op

BenchmarkSliceOfArray 216478 4881 ns/op

PASS

ok foo/bar 2.468s

ok foo/bar 3.538s

----

package datamove


type Array [2]int


func moveSliceOfArrayData(s []Array) []Array {

    for i := 1; i < len(s); i++ {

        s[i-1], s[i] = s[i], s[i-1]

    }

    return s

}


func moveSliceOfIntData(s []int) []int {

    for i := 1; i < len(s); i++ {

        s[i-1], s[i] = s[i], s[i-1]

    }

    return s

}

----

package datamove


import (

    "testing"

)


var resultInt []int

var resultArray []Array


func BenchmarkSliceOfInt(b *testing.B) {

    var r []int

    for n := 0; n < b.N; n++ {

        s := make([]int, 1000)

        r = moveSliceOfIntData(s)

    }

    resultInt = r

}


func BenchmarkSliceOfArray(b *testing.B) {

    var r []Array

    for n := 0; n < b.N; n++ {

        s := make([]Array, 1000)

        r = moveSliceOfArrayData(s)

    }

    resultArray = r

}

peterGo

unread,
Sep 20, 2022, 11:24:41 AM9/20/22
to golang-nuts
I am unable to reproduce your results on AMD64. I don't see any regression.

Run your benchmarks with option -benchmem.

Version: devel go1.20-1eeb257b88 Tue Sep 20 02:58:09 2022 +0000
Sizeof(Array{}): 16
Sizeof(int(0)): 8
goarch: amd64
BenchmarkSliceOfInt-4    391646  2583 ns/op   8192 B/op  1 allocs/op
BenchmarkSliceOfArray-4  236977  4869 ns/op  16384 B/op  1 allocs/op

Version: go1.17.13
Sizeof(Array{}): 16
Sizeof(int(0)): 8
goarch: amd64
BenchmarkSliceOfInt-4    426343  2383 ns/op   8192 B/op  1 allocs/op
BenchmarkSliceOfArray-4  233817  4771 ns/op  16384 B/op  1 allocs/op

The difference is type int is 8 bytes and type [2]int (type Array) is 16 bytes.

Change the type of Array to 8 bytes (type [1]int).

Version: devel go1.20-1eeb257b88 Tue Sep 20 02:58:09 2022 +0000
Sizeof(Array{}): 8
Sizeof(int(0)): 8
goarch: amd64
BenchmarkSliceOfInt-4    405793  2562 ns/op  8192 B/op  1 allocs/op
BenchmarkSliceOfArray-4  469999  2540 ns/op  8192 B/op  1 allocs/op

Version: go1.17.13
Sizeof(Array{}): 8
Sizeof(int(0)): 8
goarch: amd64
BenchmarkSliceOfInt-4    440340  2378 ns/op  8192 B/op  1 allocs/op
BenchmarkSliceOfArray-4  481071  2400 ns/op  8192 B/op  1 allocs/op


There is no significant difference.

peter

Brian Candler

unread,
Sep 20, 2022, 11:35:58 AM9/20/22
to golang-nuts
On Tuesday, 20 September 2022 at 16:24:41 UTC+1 peterGo wrote:
I am unable to reproduce your results on AMD64. I don't see any regression.

That's expected: the original post also said "I don't see the [...] the perfromance degradation on AMD64". The report is specific to ARM64.

peterGo

unread,
Sep 20, 2022, 12:02:37 PM9/20/22
to golang-nuts
Brian,

Yes, I read that too. A fundamental tenet of science is reproducibility. I was able to reproduce the AMD64 results and documented it. A step forward.

peter

Jan Mercl

unread,
Sep 20, 2022, 12:14:32 PM9/20/22
to golang-nuts
On Tue, Sep 20, 2022 at 6:02 PM peterGo <go.pe...@gmail.com> wrote:

> Yes, I read that too. A fundamental tenet of science is reproducibility. I was able to reproduce the AMD64 results and documented it. A step forward.

5:24 PM (45 minutes ago) > I am unable to reproduce your results on AMD64.
6:02 PM (7 minutes ago) > I was able to reproduce the AMD64 results
and documented it.

peterGo

unread,
Sep 20, 2022, 12:53:40 PM9/20/22
to golang-nuts
Jan,

An obvious typo. Read my original message: "I am unable to reproduce your results on AMD64"

peter

Josh Peterson

unread,
Sep 20, 2022, 3:04:00 PM9/20/22
to golang-nuts
Hi Peter,

Please note that versions devel go1.20-bcd44b61d3, 1.19.1, and 1.18.6 show times in the 4xxx ns/op range for BenchmarkSliceOfArray while all the versions of 1.19 and 1.18 show times in the 2xxx ns/op range. 

As noted earlier in this thread, the regression is seen on ARM64 and not AMD64.

This time I have only included only results for BenchmarkSliceOfArray. The benchmarks for BenchmarkSliceOfInt do not exhibit regression. 

go version devel go1.20-bcd44b61d3 Tue Sep 20 16:21:31 2022 +0000 darwin/arm64
BenchmarkSliceOfArray     262800              4552 ns/op           16384 B/op          1 allocs/op
go version go1.19.1 darwin/arm64
BenchmarkSliceOfArray     229941              4607 ns/op           16384 B/op          1 allocs/op
go version go1.19 darwin/arm64
BenchmarkSliceOfArray     449023              2580 ns/op           16384 B/op          1 allocs/op
go version go1.18.6 darwin/arm64
BenchmarkSliceOfArray     239430              4377 ns/op           16384 B/op          1 allocs/op
go version go1.18.5 darwin/arm64
BenchmarkSliceOfArray     506563              2349 ns/op           16384 B/op          1 allocs/op
go version go1.18.4 darwin/arm64
BenchmarkSliceOfArray     508051              2362 ns/op           16384 B/op          1 allocs/op
go version go1.18.3 darwin/arm64
BenchmarkSliceOfArray     431294              2357 ns/op           16384 B/op          1 allocs/op
go version go1.18.2 darwin/arm64
BenchmarkSliceOfArray     425918              2715 ns/op           16384 B/op          1 allocs/op
go version go1.18.1 darwin/arm64
BenchmarkSliceOfArray     434580              2351 ns/op           16384 B/op          1 allocs/op
go version go1.18 darwin/arm64
BenchmarkSliceOfArray     415366              2482 ns/op           16384 B/op          1 allocs/op

Josh Peterson

unread,
Sep 21, 2022, 11:12:10 AM9/21/22
to golang-nuts
I am going to open an issue for this regression on ARM64 unless there are objections or a view that this is the intended behavior. 

Josh
Reply all
Reply to author
Forward
0 new messages