Hi folks,
I mentioned on the recent compiler & runtime office hours call that I'd seen an approx 5% performance regression on the GoAWK benchmarks from 1.19 to 1.20 (thankfully PGO brings it back up to +5% above 1.19). I've dived a bit deeper now and have some real numbers, as well as the results of a git bisect.
First run the interpreter benchmarks on both versions of Go (my "go" is 1.19.1, and the other version is 1.20rc1):
$ go test ./interp -bench=. -count=5 >benchmarks_1.19.txt
$ go1.20rc1 test ./interp -bench=. -count=5 >benchmarks_1.20.txt
Then use benchstat to compare them:
$ benchstat -sort=delta -geomean benchmarks_1.19.txt benchmarks_1.20.txt
name old time/op new time/op delta
BinaryOperators-16 249ns ± 3% 318ns ± 2% +27.97% (p=0.008 n=5+5)
IncrDecr-16 89.9ns ± 2% 111.2ns ± 1% +23.73% (p=0.008 n=5+5)
LocalVars-16 142ns ± 4% 172ns ± 2% +21.26% (p=0.008 n=5+5)
GlobalVars-16 141ns ± 3% 170ns ± 2% +20.24% (p=0.008 n=5+5)
ArrayOperations-16 324ns ± 4% 380ns ± 3% +17.01% (p=0.008 n=5+5)
ForLoop-16 1.41µs ± 4% 1.65µs ± 4% +16.60% (p=0.008 n=5+5)
ConcatSmall-16 202ns ± 1% 235ns ± 3% +16.49% (p=0.008 n=5+5)
ConcatTwo-16 51.3ns ± 2% 57.9ns ± 2% +12.75% (p=0.008 n=5+5)
SetField-16 464ns ± 2% 523ns ± 2% +12.66% (p=0.008 n=5+5)
BuiltinSubstr-16 459ns ± 4% 517ns ± 4% +12.59% (p=0.008 n=5+5)
CSVOutputGoAWK-16 245ns ± 2% 275ns ± 4% +12.26% (p=0.008 n=5+5)
Assign-16 142ns ± 3% 159ns ± 2% +11.84% (p=0.008 n=5+5)
FuncCall-16 1.33µs ± 2% 1.48µs ± 2% +11.73% (p=0.008 n=5+5)
CondExpr-16 85.2ns ± 2% 94.3ns ± 3% +10.78% (p=0.008 n=5+5)
Printf-16 2.46µs ± 2% 2.66µs ± 5% +8.27% (p=0.008 n=5+5)
SimpleBuiltins-16 358ns ± 0% 384ns ± 3% +7.20% (p=0.016 n=4+5)
IfStatement-16 86.3ns ± 3% 91.3ns ± 4% +5.77% (p=0.008 n=5+5)
SimplePattern-16 84.4ns ± 3% 89.2ns ± 6% +5.58% (p=0.008 n=5+5)
RecursiveFunc-16 7.62µs ± 2% 8.02µs ± 3% +5.26% (p=0.008 n=5+5)
BuiltinLength-16 281ns ± 3% 292ns ± 2% +4.00% (p=0.032 n=5+5)
Print-16 1.19µs ± 2% 1.23µs ± 1% +3.82% (p=0.008 n=5+5)
BuiltinGsubAmpersand-16 9.01µs ± 1% 9.24µs ± 1% +2.59% (p=0.008 n=5+5)
BuiltinIndex-16 652ns ± 4% 670ns ± 2% ~ (p=0.222 n=5+5)
BuiltinSub-16 7.57µs ± 1% 7.59µs ± 1% ~ (p=1.000 n=5+5)
BuiltinSubAmpersand-16 7.55µs ± 1% 7.65µs ± 2% ~ (p=0.056 n=5+5)
BuiltinGsub-16 9.06µs ± 2% 9.17µs ± 2% ~ (p=0.222 n=5+5)
BuiltinSprintf-16 5.13µs ± 1% 5.17µs ± 2% ~ (p=0.548 n=5+5)
NativeFunc-16 6.24µs ± 1% 6.25µs ± 2% ~ (p=0.841 n=5+5)
RegexMatch-16 603ns ± 3% 602ns ± 3% ~ (p=0.548 n=5+5)
ConcatLarge-16 1.32ms ± 5% 1.25ms ± 9% ~ (p=0.095 n=5+5)
Comparisons-16 224ns ± 4% 235ns ± 8% ~ (p=0.222 n=5+5)
AugAssign-16 547ns ± 3% 564ns ± 3% ~ (p=0.095 n=5+5)
RepeatNew-16 67.6ns ± 4% 69.5ns ± 3% ~ (p=0.421 n=5+5)
RepeatIONew-16 645ns ± 2% 660ns ± 2% ~ (p=0.095 n=5+5)
CSVInputGoAWK-16 243ns ± 6% 248ns ± 3% ~ (p=0.548 n=5+5)
CSVInputReader-16 164ns ± 7% 164ns ±10% ~ (p=1.000 n=5+5)
CSVOutputWriter-16 157ns ±13% 166ns ± 7% ~ (p=0.095 n=5+5)
RepeatIOExecProgram-16 21.9µs ± 1% 21.3µs ± 0% -2.79% (p=0.008 n=5+5)
BuiltinMatch-16 1.46µs ± 2% 1.42µs ± 2% -2.96% (p=0.032 n=5+5)
ForInLoop-16 1.06µs ± 2% 1.02µs ± 1% -3.79% (p=0.008 n=5+5)
RepeatExecProgram-16 9.82µs ± 4% 9.40µs ± 2% -4.22% (p=0.016 n=5+5)
BuiltinSplitRegex-16 53.4µs ± 1% 50.2µs ± 2% -6.03% (p=0.008 n=5+5)
GetField-16 198ns ± 3% 184ns ± 2% -7.40% (p=0.008 n=5+5)
BuiltinSplitSpace-16 23.4µs ± 3% 21.6µs ± 2% -7.63% (p=0.008 n=5+5)
[Geo mean] 891ns 940ns +5.50%
So overall a 5.5% increase in time taken (reduction in speed) from 1.19 to 1.20.
I won't copy the numbers here, but if I capture a CPU profile when running the benchmarks and build with PGO, that +5.50% changes to -5.55% compared to 1.19 (so 1.20 is about 10% faster with PGO compared to 1.20 without PGO). So at least I can recover the speed using a PGO build.
If the -count=5 results were close to Go 1.19's 249ns time I typed "git bisect good", if they were close to Go 1.20's 318ns time I typed "git bisect bad", and it ended up saying "833367e98af838a2511ee7e4e19dc8f1da7b8ed7 is the first bad commit". That's the one-liner CL
https://go-review.googlesource.com/c/go/+/422235: "
internal/buildcfg: enable unified IR by default".
(Because of noisy benchmarks it was a little tricky sometimes, but I think the results were fairly clear one way or the other at each step.)
I don't really know what effect the unified IR had, but maybe the compiler folks will?
I also tried a profile comparison using pprof's -diff_base option, but found the output a bit hard to understand.
The commands I used for the profile comparison were these (I believe):
$ go test ./interp -bench=BinaryOperators -count=5 -cpuprofile=bin_1.19.prof
$ go1.20rc1 test ./interp -bench=BinaryOperators -count=5 -cpuprofile=bin_1.20.prof
$ go tool pprof -diff_base bin_1.20.prof -http=:8080 bin_1.19.prof
I know the relatively recent "switch" optimization to use jump tables helped with large integer switch statements like this. Maybe the unified IR somehow caused a regression on the switch-with-jump-tables optimization?
Or maybe changes in inlining (which are presumably improvements in other contexts?) cause that execute() code to be really big and not play nicely with processor caches or something? That's my not very informed guessing.
Another set of benchmarks GoAWK has are the slightly more real world ones -- these don't use "go test -bench", but run various AWK scripts and AWK versions using a Python script and time them. For getter or worse, these are displayed in terms of speed relative to onetrueawk's time (time_of_onetrueawk / this_time), so bigger is better. Here are those results for 1.19 vs 1.20:
Test | goawk_1.19 | goawk_1.20 | awk
---------------------------- | -------- | -------- | --------
tt.01 (print) | 3.19 | 3.09 | 1.00
tt.02 (print NR NF) | 1.99 | 1.95 | 1.00
tt.02a (print length) | 1.81 | 1.80 | 1.00
tt.03 (sum length) | 1.38 | 1.40 | 1.00
tt.03a (sum field) | 1.44 | 1.42 | 1.00
tt.04 (printf fields) | 1.18 | 1.15 | 1.00
tt.05 (concat fields) | 1.63 | 1.50 | 1.00
tt.06 (count lengths) | 1.48 | 1.47 | 1.00
tt.07 (even fields) | 1.39 | 1.35 | 1.00
tt.08 (even lengths) | 2.55 | 2.44 | 1.00
tt.08z (regex simple) | 3.85 | 3.78 | 1.00
tt.09 (regex starts with) | 2.73 | 2.69 | 1.00
tt.10 (regex ends with) | 0.41 | 0.43 | 1.00
tt.10a (regex ends with var) | 0.39 | 0.42 | 1.00
tt.11 (substr) | 3.17 | 3.17 | 1.00
tt.12 (update fields) | 1.61 | 1.51 | 1.00
tt.13 (array ops) | 4.42 | 4.19 | 1.00
tt.13a (array printf) | 2.83 | 2.69 | 1.00
tt.14 (function call) | 1.62 | 1.42 | 1.00
tt.15 (format lines) | 0.89 | 0.80 | 1.00
tt.16 (count words) | 1.68 | 1.54 | 1.00
tt.big (complex program) | 2.05 | 1.97 | 1.00
tt.x1 (mandelbrot) | 2.08 | 1.69 | 1.00
tt.x2 (sum loop) | 2.05 | 1.69 | 1.00
---------------------------- | -------- | -------- | --------
**Geo mean** | **1.73** | **1.66** | **1.00**
In any case, I hope that's helpful. Let me know if something's not clear or if I can provide other numbers to help.
For reference, I'm on Linux (Ubuntu 22.04) on amd64 using a 12th Gen Intel Core i7-1260P × 16.
Cheers,
Ben