noinline is 25% faster than inline on apple m1 ?

eric...@arm.com

unread,

Jul 22, 2022, 5:52:05 AM7/22/22

to golang-nuts

For this piece of code, two test functions are the same, but one is inlined, the other is not. However the inlined version is about 25% slower than the no inlined version on apple m1 chip. Why is it?

The code is here https://go.dev/play/p/0NkLMtTZtv4

Michael Pratt

unread,

Jul 22, 2022, 1:01:00 PM7/22/22

to eric...@arm.com, golang-nuts

I can reproduce similar behavior on linux-amd64:

$ perf stat ./example.com.test -test.bench=BenchmarkInline -test.benchtime=100000000x

goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkInline-12 100000000 16.78 ns/op
PASS

Performance counter stats for './example.com.test -test.bench=BenchmarkInline -test.benchtime=100000000x':

1,691.95 msec task-clock:u # 1.004 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
352 page-faults:u # 208.044 /sec
6,732,752,072 cycles:u # 3.979 GHz
22,405,823,428 instructions:u # 3.33 insn per cycle
6,501,294,164 branches:u # 3.842 G/sec
149,596 branch-misses:u # 0.00% of all branches

1.684677260 seconds time elapsed

1.692474000 seconds user
0.004020000 seconds sys

$ perf stat ./example.com.test -test.bench=BenchmarkNoInline -test.benchtime=100000000x
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkNoInline-12 100000000 10.79 ns/op
PASS

Performance counter stats for './example.com.test -test.bench=BenchmarkNoInline -test.benchtime=100000000x':

1,091.71 msec task-clock:u # 1.005 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
363 page-faults:u # 332.505 /sec
4,490,159,750 cycles:u # 4.113 GHz
20,205,764,499 instructions:u # 4.50 insn per cycle
6,701,281,015 branches:u # 6.138 G/sec
586,073 branch-misses:u # 0.01% of all branches

1.086302272 seconds time elapsed

1.087710000 seconds user
0.008027000 seconds sys

The non-inlined version is actually fewer instructions to run the same benchmark, which surprises me because naively looking at the disassembly it seems that the inlined version is much more compact.

On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com <eric...@arm.com> wrote:

For this piece of code, two test functions are the same, but one is inlined, the other is not. However the inlined version is about 25% slower than the no inlined version on apple m1 chip. Why is it?

The code is here https://go.dev/play/p/0NkLMtTZtv4

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com.

Kevin Chowski

unread,

Jul 22, 2022, 9:56:54 PM7/22/22

to golang-nuts

Datapoint: same with windows/amd64 on Intel (running 1.19beta1):

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4 77425848 14.34 ns/op
BenchmarkInline-4 59108932 20.58 ns/op
PASS
ok common/sandbox 2.645s

Looking at the disassembly, I noticed that in the Inline case there was a 7-byte `lea 0xXXXXXX(%rip),%rbx` in the tight inner loop due to some really proactive constant propagation (I hypothesize). If you manually defeat the propagation by storing the string in a global and manually copying it into the stack, the inlined becomes faster than NoInline again: https://go.dev/play/p/VRgJP2y7joS

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4 81436539 14.08 ns/op
BenchmarkInline-4 59255162 21.32 ns/op
BenchmarkInlineDefeatConstProp-4 97524828 12.57 ns/op
PASS
ok common/sandbox 5.111s

Kevin Chowski

unread,

Jul 22, 2022, 10:18:50 PM7/22/22

to golang-nuts

Sorry for the double-post, I just realized that the version I posted before had my manually-inlined version that I did as a part of testing. For completeness, here's the non-manually-inlined version, which seems have the same performance qualities (and probably exactly the same machine code, though I didn't double-check): https://go.dev/play/p/h1K38Bq7Otv

Taj Khattra

unread,

Jul 22, 2022, 10:33:47 PM7/22/22

to golang-nuts

i get similar results with 1.18 (inline slower than noinline)

but different results with 1.16, 1.17, and 1.19rc2 (inline faster than noinline)

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 5600X 6-Core Processor

======== 1.16.15
BenchmarkNoInline-12 125717362 9.607 ns/op
BenchmarkInline-12 150066394 8.721 ns/op

BenchmarkNoInline-12 125476344 9.710 ns/op
BenchmarkInline-12 133781608 8.851 ns/op

======== 1.17.10
BenchmarkNoInline-12 100000000 10.14 ns/op
BenchmarkInline-12 135818722 8.646 ns/op

BenchmarkNoInline-12 123817206 10.61 ns/op
BenchmarkInline-12 137691572 8.754 ns/op

======== 1.18.4
BenchmarkNoInline-12 121646458 10.13 ns/op
BenchmarkInline-12 81420973 14.65 ns/op

BenchmarkNoInline-12 123927972 10.05 ns/op
BenchmarkInline-12 81371038 14.64 ns/op

======== 1.19rc2
BenchmarkNoInline-12 120799062 9.864 ns/op
BenchmarkInline-12 147306990 8.579 ns/op

BenchmarkNoInline-12 120426837 10.17 ns/op
BenchmarkInline-12 129029052 8.621 ns/op

Keith Randall

unread,

Jul 23, 2022, 12:56:48 PM7/23/22

to golang-nuts

Yes, I think this is the extra LEAQ that appears in the loop. Ideally it would be lifted out of the loop. I think that is https://github.com/golang/go/issues/15808

Reply all

Reply to author

Forward