Table-driven benchmarks defeat inlining

Paul S. R. Chisholm

unread,

May 31, 2021, 6:29:27 PM5/31/21

to golan...@googlegroups.com

This is not a serious problem, but it surprised me. (By the way, how can I post a message with code formatting?)

I'd like to create table-driven benchmarks:

https://blog.golang.org/subtests.

To that end, I started with code from The Go Programming Language:

// PopCount is based on an example from chapter 2 of THE GO PROGRAMMING LANGUAGE.

// License: https://creativecommons.org/licenses/by-nc-sa/4.0/

package popcount

// pc[i] is the population count of i.

var pc [256]byte

func init() {

for i := range pc {

pc[i] = pc[i/2] + byte(i&1)

}

// PopCount returns the population count (number of set bits) of x.

func PopCount(x uint64) int {

return int(pc[byte(x>>(0*8))] +

pc[byte(x>>(1*8))] +

pc[byte(x>>(2*8))] +

pc[byte(x>>(3*8))] +

pc[byte(x>>(4*8))] +

pc[byte(x>>(5*8))] +

pc[byte(x>>(6*8))] +

pc[byte(x>>(7*8))])

}

I then wrote a simple test:

// popcount_simple_test.go

package popcount

import "testing"

func BenchmarkPopCountSimple(b *testing.B) {

sum := 0 // Avoid dead code elimination.

for i := 0; i < b.N; i++ {

sum += PopCount(0x1234567890abcdef)

}

Because I was paranoid about dead code elimination, I wrote an identical test that calls the function many times (inside the loop to call it b.N times, which should make no difference if the hypothetically-still-dead code is being eliminated):

// popcount_slow_simple_test.go

package popcount

import "testing"

func BenchmarkPopCountSlowSimple(b *testing.B) {

sum := 0 // Avoid dead code elimination.

for i := 0; i < b.N; i++ {

// Exactly as before, but call the function many times.

for j:= 0; j < 1_000_000; j++ {

sum += PopCount(0x1234567890abcdef)

}

Then I wrote an "almost simple" test that uses testing.B.Run() but is hardwired to call the function being benchmarked:

// popcount_almost_simple_test.go

package popcount

import "testing"

func BenchmarkPopCountAlmostSimple(b *testing.B) {

b.Run("BenchmarkPopCountAlmostSimple", func(b *testing.B) {

sum := 0 // Avoid dead code elimination.

for i := 0; i < b.N; i++ {

sum += PopCount(0x1234567890abcdef)

}

})

}

Finally, I wrote a "nearly-simple" test that passes arguments to Run() that could have come from a table:

// popcount_nearly_simple.go

package popcount

import "testing"

func BenchmarkPopCountNearlySimple(b *testing.B) {

f := PopCount

name := "PopCountNearlySimple"

b.Run(name, func(b *testing.B) {

sum := 0 // Avoid dead code elimination.

for i := 0; i < b.N; i++ {

sum += f(0x1234567890abcdef)

}

})

}

The simple and almost-simple results are nearly identical (happily, the slow results are not), but the nearly-simple results are an order of magnitude slower:

$ go version

go version go1.16.4 windows/amd64

$ go test -cpu=1 -bench=.

goos: windows

goarch: amd64

pkg: gopl.io/popcount

cpu: Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz

BenchmarkPopCountAlmostSimple/BenchmarkPopCountAlmostSimple 1000000000 0.6102 ns/op

BenchmarkPopCountNearlySimple/PopCountNearlySimple 194925662 6.197 ns/op

BenchmarkPopCountSimple 1000000000 0.5994 ns/op

BenchmarkPopCountSlowSimple 1953 606194 ns/op

PASS

ok gopl.io/popcount 4.534s

After reading this article:

https://medium.com/a-journey-with-go/go-inlining-strategy-limitation-6b6d7fc3b1be

I ran:

$ go test -cpu=1 -bench=. -gcflags=-m 2>&1 | egrep 'inlining call to PopCount'
.\popcount_almost_simple_test.go:9:19: inlining call to PopCount
.\popcount_simple_test.go:8:18: inlining call to PopCount
.\popcount_slow_simple_test.go:10:19: inlining call to PopCount

That makes sense. 0.6 ns is less than a function call. The simple and almost-simple benchmarks can inline the call to the hardwired function, but the nearly-simple benchmark can't.

In practice, this isn't much of a problem. Any function small enough to be inlined is unlikely to be a performance bottleneck. If it ever is, a non-table-driven benchmark can still measure it.

Hope this helps. --PSRC

P.S.: Out of curiosity, how can I post a message with fancy code examples like this one?

https://groups.google.com/g/golang-nuts/c/5DgtH2Alt_I/m/hlsqdRSGAgAJ

popcount.go

popcount_almost_simple_test.go

popcount_slow_simple_test.go

popcount_simple_test.go

popcount_nearly_simple_test.go

peterGo

unread,

May 31, 2021, 9:31:06 PM5/31/21

to golang-nuts

I got as far as this

func BenchmarkPopCountSimple(b *testing.B) {
   sum := 0 // Avoid dead code elimination.
   for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
   }
}

I added additional benchmarks

func BenchmarkPopCountSimpleX(b *testing.B) {
sum := 0 // Does not avoid dead code elimination.

   for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
   }
}

var sum = 0 // Avoid dead code elimination.

func BenchmarkPopCountSimpleY(b *testing.B) {
sum = 0

   for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
   }
}

func BenchmarkPopCountSimpleZ(b *testing.B) {

for i := 0; i < b.N; i++ {

        // Dead code elimination.
   }
}

The results

$ go version
go version devel go1.17-3b770f2ccb Sun May 30 17:47:50 2021 +0000 linux/amd64
$ go test -bench=BenchmarkPopCountSimple
BenchmarkPopCountSimple-4        1000000000             0.3634 ns/op
BenchmarkPopCountSimpleX-4      1000000000             0.3727 ns/op
BenchmarkPopCountSimpleY-4      508559316              2.295 ns/op
BenchmarkPopCountSimpleZ-4      1000000000             0.3655 ns/op
$

It doesn't look like you have avoided dead code elimination for BenchmarkPopCountSimple.

Peter

Jan Mercl

unread,

Jun 1, 2021, 3:03:20 AM6/1/21

to Paul S. R. Chisholm, golang-nuts

On Tue, Jun 1, 2021 at 12:29 AM Paul S. R. Chisholm
<psrch...@gmail.com> wrote:

> P.S.: Out of curiosity, how can I post a message with fancy code examples like this one?
> https://groups.google.com/g/golang-nuts/c/5DgtH2Alt_I/m/hlsqdRSGAgAJ

FTR, that's not fancy in my books. Code is best posted to email lists
as plain text. Or as a link to the Go playground, like
https://play.golang.org/p/x47aMnqe8ob.

Manlio Perillo

unread,

Jun 1, 2021, 6:08:23 AM6/1/21

to golang-nuts

On Tuesday, June 1, 2021 at 12:29:27 AM UTC+2 Paul S. R. Chisholm wrote:

This is not a serious problem, but it surprised me. (By the way, how can I post a message with code formatting?)

I'd like to create table-driven benchmarks:
https://blog.golang.org/subtests.

To that end, I started with code from The Go Programming Language:

> [...]

P.S.: Out of curiosity, how can I post a message with fancy code examples like this one?
https://groups.google.com/g/golang-nuts/c/5DgtH2Alt_I/m/hlsqdRSGAgAJ

You should use https://play.golang.org/, instead.

It can run main packages and test files. It also supports multiple files (see the select control on the right of share button).

The documentation of txtar (used by the playground) is in https://pkg.go.dev/golang.org/x/tools/txtar

Manlio

peterGo

unread,

Jun 1, 2021, 12:01:51 PM6/1/21

to golang-nuts

Here is a more definitive reply than my original reply.

I got as far as this

func BenchmarkPopCountSimple(b *testing.B) {
    sum := 0 // Avoid dead code elimination.
    for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
    }
}

As you can see from the objdump, BenchmarkPopCountSimple dead code is eliminated.

func BenchmarkPopCountSimple(b *testing.B) {
0x4e38c0 31c9 XORL CX, CX

for i := 0; i < b.N; i++ {

0x4e38c2        eb03            JMP 0x4e38c7
0x4e38c4        48ffc1            INCQ CX
0x4e38c7        48398890010000        CMPQ CX, 0x190(AX)
0x4e38ce        7ff4            JG 0x4e38c4
}
0x4e38d0        c3            RET

I added an additional BenchmarkPopCountAlive benchmark.

var sum = 0 // Avoid dead code elimination.

func BenchmarkPopCountAlive(b *testing.B) {
sum = 0

    for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
    }
}

As you can see from the objdump, BenchmarkPopCountAlive code is not eliminated.

For details, see the popcount.txt attachment.

Peter

popcount.txt

Paul S. R. Chisholm

unread,

Jun 6, 2021, 8:35:02 AM6/6/21

to golan...@googlegroups.com

Thanks! I'd seen the "dead code elimination" comment somewhere and questioned it, but not enough.

If I worry about what some future Go compiler might optimize, I end up worrying quite a lot! For example, could this code:

func BenchmarkPopCountAlive(b *testing.B) {
sum = 0

    for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
    }
}

hypothetically be optimized to:

func BenchmarkPopCountAlive(b *testing.B) {
sum = PopCount(0x1234567890abcdef) * b.N

}

since PopCount() always returns the same value for the same argument? It probably wouldn't, since that would break many existing benchmarks.

Maybe more reasonably worrisome: If sum is initialized and incremented but never read, could all references to it be optimized away? That's easy enough to avoid, as is the optimization I worried about above:

var sum = 0 // Avoid dead code elimination
const firstInput uint64 = 0x1234567890abcdef

func BenchmarkPopCountSimple(b *testing.B) {
for i, input := 0, firstInput; i < b.N; i++ {
sum += PopCount(input)
input++
}
if sum == 0 {
b.Error("BenchmarkPopCountSimple: sum == 0")
}
}

Here are my new benchmark results:

$ go version
go version go1.16.5 windows/amd64

$ go test -cpu=1 -bench=.
goos: windows
goarch: amd64
pkg: gopl.io/popcount
cpu: Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz

BenchmarkPopCountSimple 271004790 4.419 ns/op
BenchmarkPopCountSlowSimple 278 4235890 ns/op
BenchmarkPopCountAlmostSimple/BenchmarkPopCountAlmostSimple 272759134 4.434 ns/op
BenchmarkPopCountNearlySimple/PopCountNearlySimple 145613077 8.172 ns/op
PASS
ok gopl.io/popcount 7.061s

The anomalistically-low "simple" and "almost simple" results are now more reasonable, but still nearly a factor of two lower than the "nearly simple" results. They're still inlining the calls, as is the "slow simple" benchmark, where the "nearly simple" one isn't:

$ go test -cpu=1 -bench=. -gcflags=-m 2>&1 | egrep 'inlining call to PopCount'

.\popcount_test.go:10:18: inlining call to PopCount
.\popcount_test.go:22:19: inlining call to PopCount
.\popcount_test.go:34:19: inlining call to PopCount

Overkill, right? --PSRC

P.S.: For better or worse, I discovered I can -- but shouldn't, based on the feedback I got! -- get "fancy" formatting by copying from vscode and pasting into Gmail.

On Tuesday, June 1, 2021 at 12:01:51 PM UTC-4 peterGo wrote:

Here is a more definitive reply than my original reply.

I got as far as this

func BenchmarkPopCountSimple(b *testing.B) {
    sum := 0 // Avoid dead code elimination.
    for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
    }
}

As you can see from the objdump, BenchmarkPopCountSimple dead code is eliminated.

func BenchmarkPopCountSimple(b *testing.B) {
0x4e38c0 31c9 XORL CX, CX

for i := 0; i < b.N; i++ {

0x4e38c2        eb03            JMP 0x4e38c7
0x4e38c4        48ffc1            INCQ CX
0x4e38c7        48398890010000        CMPQ CX, 0x190(AX)
0x4e38ce        7ff4            JG 0x4e38c4
}
0x4e38d0        c3            RET

I added an additional BenchmarkPopCountAlive benchmark.

var sum = 0 // Avoid dead code elimination.

func BenchmarkPopCountAlive(b *testing.B) {
    sum = 0

    for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
    }
}

As you can see from the objdump, BenchmarkPopCountAlive code is not eliminated.

For details, see the popcount.txt attachment.

Peter

On Monday, May 31, 2021 at 6:29:27 PM UTC-4 Paul S. R. Chisholm wrote:

popcount_test.go

Ian Davis

unread,

Jun 7, 2021, 6:14:05 AM6/7/21

to golan...@googlegroups.com

Would it be feasible for the Go tool to disable inlining and deadcode elimination of code within the bodies of Benchmarks and Tests? Not the code under test of course. It could be as crude as disabling these optimizations for files in _test.go files.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAG%3D3cjnebauo-VDiX%2BQjPVnnSji1DA0DtxsTKkOcTW9eVfTN8g%40mail.gmail.com.

Attachments:
popcount_test.go

peterGo

unread,

Jun 7, 2021, 11:47:20 AM6/7/21

to golang-nuts

Ian,

I don't know whether it is feasible. It is unnecessary.

A benchmark should be treated as a scientific experiment. If we do that with Paul's benchmarks and write them in scientific form, then we get the expected results.

$ benchstat xpt.txt
name                                                  time/op
PopCountSimple-4                                      5.71ns ± 0%
PopCountSlowSimple-4                                  5.70ms ± 0%
PopCountAlmostSimple/BenchmarkPopCountAlmostSimple-4 5.70ns ± 0%
PopCountNearlySimple/PopCountNearlySimple-4           5.69ns ± 0%

$

Peter

popcount_test.go

Ian Davis

unread,

Jun 7, 2021, 12:05:52 PM6/7/21

to golan...@googlegroups.com

Go does a good job of making it easy to do the right thing, especially in tests (such as parallel benchmarks or setting env variables), and avoiding the need for tricks like package level variables or noinline directives seems a useful feature.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/ce3ef331-4156-4b29-97f8-c73456a7b6f7n%40googlegroups.com.

Attachments:
popcount_test.go

drc...@google.com

unread,

Jun 8, 2021, 2:52:36 PM6/8/21

to golang-nuts

On Sunday, June 6, 2021 at 8:35:02 AM UTC-4 Paul S. R. Chisholm wrote:

For example, could this code:

func BenchmarkPopCountAlive(b *testing.B) {
    sum = 0
    for i := 0; i < b.N; i++ {
        sum += PopCount(0x1234567890abcdef)
    }
}

hypothetically be optimized to:

func BenchmarkPopCountAlive(b *testing.B) {
    sum = PopCount(0x1234567890abcdef) * b.N
}

since PopCount() always returns the same value for the same argument? It probably wouldn't, since that would break many existing benchmarks.

We're talking about adding that very optimization, and I estimate the chance of it appearing in the next 2 years is above 50%.

Preserving benchmark performance is an anti-goal for people working on general-purpose (*) compilers and runtimes.

(*) crypto and real-time are special-purpose.

As to "why", I know of at least three reasons:

- hoisting invariant expressions out of loops is one of those mechanical things that computers do pretty well, and Go's package layering makes it easy to gather "pure function" information.

- we're getting more and more Go programmers, doing more and more diverse things, they won't all be interested in learning hand-optimization tricks.

- when we turn on generics, and as more things are written generically, generic-expansion will remove opportunities to hand-optimize code in the usual way

So we are *definitely* thinking about this.

As to what to do for your benchmarks, I am not sure. One obvious answer is "write them more carefully", but this does somewhat subvert the idea that it's easy to write benchmarks in Go. We might also provide more guidance, as in "a benchmark must do useful work that is consumed or checked" and then provide examples of the same. It's not just the Go compiler, either -- if a loop-invariant input causes code to take exactly the same paths over and over again, hardware branch predictors will do a better-than-usual job.

Reply all

Reply to author

Forward