I can reproduce similar behavior on linux-amd64:
$ perf stat ./example.com.test -test.bench=BenchmarkInline -test.benchtime=100000000x
goos: linux
goarch: amd64
pkg:
example.comcpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkInline-12 100000000 16.78 ns/op
PASS
Performance counter stats for './example.com.test -test.bench=BenchmarkInline -test.benchtime=100000000x':
1,691.95 msec task-clock:u # 1.004 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
352 page-faults:u # 208.044 /sec
6,732,752,072 cycles:u # 3.979 GHz
22,405,823,428 instructions:u # 3.33 insn per cycle
6,501,294,164 branches:u # 3.842 G/sec
149,596 branch-misses:u # 0.00% of all branches
1.684677260 seconds time elapsed
1.692474000 seconds user
0.004020000 seconds sys
$ perf stat ./example.com.test -test.bench=BenchmarkNoInline -test.benchtime=100000000x
goos: linux
goarch: amd64
pkg:
example.comcpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkNoInline-12 100000000 10.79 ns/op
PASS
Performance counter stats for './example.com.test -test.bench=BenchmarkNoInline -test.benchtime=100000000x':
1,091.71 msec task-clock:u # 1.005 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
363 page-faults:u # 332.505 /sec
4,490,159,750 cycles:u # 4.113 GHz
20,205,764,499 instructions:u # 4.50 insn per cycle
6,701,281,015 branches:u # 6.138 G/sec
586,073 branch-misses:u # 0.01% of all branches
1.086302272 seconds time elapsed
1.087710000 seconds user
0.008027000 seconds sys
The non-inlined version is actually fewer instructions to run the same benchmark, which surprises me because naively looking at the disassembly it seems that the inlined version is much more compact.