Over the weekend, I was debugging an unexpectedly slow Go program and
the profiler fingered a function that was doing nothing but calling
sort.Search as in the documentation.
func find(data []int, x int) int {
return sort.Search(len(data), func(i int) bool { return data[i] >= x })
}
The profile showed a lot of time spent in runtime.closure, and
runtime.mallocgc. As far as I can tell, that dominated the actual
time spent searching the array. I replaced the binary search with a
simple linear search and knocked a factor of 5 off of the runtime.
Could I have written the call to sort.Search in a way that didn't pay
a large penalty for creating a closure and garbage-collecting it
later? Is this something the compiler might be able to optimize?
I wrote a short benchmark to demonstrate the problem:
package search_test
import (
"sort"
"testing"
)
func benchmarkBinarySearch(b *testing.B, size int) {
b.StopTimer()
data := make([]int, size)
for i := 0; i < len(data); i++ {
data[i] = i
}
b.StartTimer()
for i := 0; i < b.N; i++ {
for targ := 0; targ <= size; targ += size/2 {
j := sort.Search(size, func(j int) bool { return data[j] >= targ })
if j != targ {
b.Errorf("sort.Search(%d, %d) = %d", size, targ, j)
}
}
}
}
func BenchmarkBinarySearch128(b *testing.B) { benchmarkBinarySearch(b, 1<<7) }
func BenchmarkBinarySearch256(b *testing.B) { benchmarkBinarySearch(b, 1<<8) }
func BenchmarkBinarySearch512(b *testing.B) { benchmarkBinarySearch(b, 1<<9) }
func BenchmarkBinarySearch1K(b *testing.B) { benchmarkBinarySearch(b, 1<<10) }
func linearSearch(data []int, targ int) int {
for i := range data {
if data[i] >= targ {
return i
}
}
return len(data)
}
func benchmarkLinearSearch(b *testing.B, size int) {
b.StopTimer()
data := make([]int, size)
for i := 0; i < len(data); i++ {
data[i] = i
}
b.StartTimer()
for i := 0; i < b.N; i++ {
for targ := 0; targ <= size; targ += size/2 {
j := linearSearch(data, targ)
if j != targ {
b.Errorf("linearSearch(%d, %d) = %d", size, targ, j)
}
}
}
}
func BenchmarkLinearSearch128(b *testing.B) { benchmarkLinearSearch(b, 1<<7) }
func BenchmarkLinearSearch256(b *testing.B) { benchmarkLinearSearch(b, 1<<8) }
func BenchmarkLinearSearch512(b *testing.B) { benchmarkLinearSearch(b, 1<<9) }
func BenchmarkLinearSearch1K(b *testing.B) { benchmarkLinearSearch(b, 1<<10) }
Results:
BenchmarkBinarySearch128 2000000 942 ns/op
BenchmarkBinarySearch256 2000000 975 ns/op
BenchmarkBinarySearch512 2000000 979 ns/op
BenchmarkBinarySearch1K 1000000 1001 ns/op
BenchmarkLinearSearch128 10000000 192 ns/op
BenchmarkLinearSearch256 5000000 347 ns/op
BenchmarkLinearSearch512 5000000 656 ns/op
BenchmarkLinearSearch1K 1000000 1276 ns/op
The break-even point is around 1,000.
A profile from running just the binary search benchmarks:
(pprof) top10
Total: 899 samples
225 25.0% 25.0% 225 25.0%
_/home/dadkins/sandbox/search_test._func_001
147 16.4% 41.4% 372 41.4% sort.Search
81 9.0% 50.4% 487 54.2% runtime.closure
64 7.1% 57.5% 141 15.7% runtime.MCache_Alloc
64 7.1% 64.6% 64 7.1% scanblock
36 4.0% 68.6% 370 41.2% runtime.mallocgc
35 3.9% 72.5% 88 9.8% sweepspan
30 3.3% 75.9% 30 3.3% runtime.memmove
28 3.1% 79.0% 31 3.4% MCentral_Alloc
27 3.0% 82.0% 880 97.9%
_/home/dadkins/sandbox/search_test.benchmarkBinarySearch
-Dan
BenchmarkBinarySearch128 10000000 186 ns/op
BenchmarkBinarySearch256 10000000 211 ns/op
BenchmarkBinarySearch512 10000000 233 ns/op
BenchmarkBinarySearch1K 10000000 265 ns/op
BenchmarkLinearSearch128 10000000 228 ns/op
BenchmarkLinearSearch256 5000000 397 ns/op
BenchmarkLinearSearch512 5000000 741 ns/op
BenchmarkLinearSearch1K 1000000 1426 ns/op
I understand that the closure is the costly part; that's what the
distilled benchmark was intended to show. The question is, how can I
write my original function to avoid that (without resorting to a
global)? Or, I guess, why isn't the straightforward code with the
O(log n) algorithm fast enough?
func find(data []int, x int) int {
return sort.Search(len(data), func(i int) bool { return data[i] >= x })
}
-Dan
The question is, how can I
write my original function to avoid that (without resorting to a
global)?
sort.Search(len(data), func(i int) bool { return data[i] >= x })
would recognize that sort.Search did not retain the
closure, so the closure did not need to be allocated
(and thus not garbage collected), but there was not
enough time to do everything I had hoped for.
This is still something we can do for Go 1.1.
Russ