noinline is 25% faster than inline on apple m1 ?

245 views
Skip to first unread message

eric...@arm.com

unread,
Jul 22, 2022, 5:52:05 AM7/22/22
to golang-nuts
For this piece of code, two test functions are the same, but one is inlined, the other is not. However the inlined version is about 25% slower than the no inlined version on apple m1 chip. Why is it?

The code is here https://go.dev/play/p/0NkLMtTZtv4

Michael Pratt

unread,
Jul 22, 2022, 1:01:00 PM7/22/22
to eric...@arm.com, golang-nuts
I can reproduce similar behavior on linux-amd64:

perf stat ./example.com.test -test.bench=BenchmarkInline -test.benchtime=100000000x
goos: linux                              
goarch: amd64                                              
pkg: example.com
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz    
BenchmarkInline-12      100000000               16.78 ns/op                                                                                  
PASS
                                                                     
 Performance counter stats for './example.com.test -test.bench=BenchmarkInline -test.benchtime=100000000x':

          1,691.95 msec task-clock:u              #    1.004 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                    
                 0      cpu-migrations:u          #    0.000 /sec                    
               352      page-faults:u             #  208.044 /sec                    
     6,732,752,072      cycles:u                  #    3.979 GHz                    
    22,405,823,428      instructions:u            #    3.33  insn per cycle        
     6,501,294,164      branches:u                #    3.842 G/sec                  
           149,596      branch-misses:u           #    0.00% of all branches        

       1.684677260 seconds time elapsed

       1.692474000 seconds user
       0.004020000 seconds sys



$ perf stat ./example.com.test -test.bench=BenchmarkNoInline -test.benchtime=100000000x
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkNoInline-12            100000000               10.79 ns/op
PASS

 Performance counter stats for './example.com.test -test.bench=BenchmarkNoInline -test.benchtime=100000000x':

          1,091.71 msec task-clock:u              #    1.005 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                    
                 0      cpu-migrations:u          #    0.000 /sec                    
               363      page-faults:u             #  332.505 /sec                    
     4,490,159,750      cycles:u                  #    4.113 GHz                    
    20,205,764,499      instructions:u            #    4.50  insn per cycle        
     6,701,281,015      branches:u                #    6.138 G/sec                  
           586,073      branch-misses:u           #    0.01% of all branches        

       1.086302272 seconds time elapsed

       1.087710000 seconds user
       0.008027000 seconds sys

The non-inlined version is actually fewer instructions to run the same benchmark, which surprises me because naively looking at the disassembly it seems that the inlined version is much more compact.


On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com <eric...@arm.com> wrote:
For this piece of code, two test functions are the same, but one is inlined, the other is not. However the inlined version is about 25% slower than the no inlined version on apple m1 chip. Why is it?

The code is here https://go.dev/play/p/0NkLMtTZtv4

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com.

Kevin Chowski

unread,
Jul 22, 2022, 9:56:54 PM7/22/22
to golang-nuts
Datapoint: same with windows/amd64 on Intel (running 1.19beta1):

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4     77425848                14.34 ns/op
BenchmarkInline-4       59108932                20.58 ns/op
PASS
ok      common/sandbox  2.645s

Looking at the disassembly, I noticed that in the Inline case there was a 7-byte `lea    0xXXXXXX(%rip),%rbx` in the tight inner loop due to some really proactive constant propagation (I hypothesize). If you manually defeat the propagation by storing the string in a global and manually copying it into the stack, the inlined becomes faster than NoInline again: https://go.dev/play/p/VRgJP2y7joS

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4                     81436539                14.08 ns/op
BenchmarkInline-4                       59255162                21.32 ns/op
BenchmarkInlineDefeatConstProp-4        97524828                12.57 ns/op
PASS
ok      common/sandbox  5.111s

Kevin Chowski

unread,
Jul 22, 2022, 10:18:50 PM7/22/22
to golang-nuts
Sorry for the double-post, I just realized that the version I posted before had my manually-inlined version that I did as a part of testing. For completeness, here's the non-manually-inlined version, which seems have the same performance qualities (and probably exactly the same machine code, though I didn't double-check): https://go.dev/play/p/h1K38Bq7Otv

Taj Khattra

unread,
Jul 22, 2022, 10:33:47 PM7/22/22
to golang-nuts
i get similar results with 1.18 (inline slower than noinline)
but different results with 1.16, 1.17, and 1.19rc2 (inline faster than noinline)

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 5600X 6-Core Processor

======== 1.16.15
BenchmarkNoInline-12        125717362            9.607 ns/op
BenchmarkInline-12          150066394            8.721 ns/op

BenchmarkNoInline-12        125476344            9.710 ns/op
BenchmarkInline-12          133781608            8.851 ns/op

======== 1.17.10
BenchmarkNoInline-12        100000000           10.14 ns/op
BenchmarkInline-12          135818722            8.646 ns/op

BenchmarkNoInline-12        123817206           10.61 ns/op
BenchmarkInline-12          137691572            8.754 ns/op

======== 1.18.4
BenchmarkNoInline-12        121646458           10.13 ns/op
BenchmarkInline-12          81420973            14.65 ns/op

BenchmarkNoInline-12        123927972           10.05 ns/op
BenchmarkInline-12          81371038            14.64 ns/op

======== 1.19rc2
BenchmarkNoInline-12        120799062            9.864 ns/op
BenchmarkInline-12          147306990            8.579 ns/op

BenchmarkNoInline-12        120426837           10.17 ns/op
BenchmarkInline-12          129029052            8.621 ns/op


Keith Randall

unread,
Jul 23, 2022, 12:56:48 PM7/23/22
to golang-nuts
Yes, I think this is the extra LEAQ that appears in the loop. Ideally it would be lifted out of the loop. I think that is https://github.com/golang/go/issues/15808
Reply all
Reply to author
Forward
0 new messages