Performance regression from 1.19 to 1.20 on GoAWK benchmarks

860 views
Skip to first unread message

ben...@gmail.com

unread,
Dec 21, 2022, 2:33:29 AM12/21/22
to golang-dev
Hi folks,

I mentioned on the recent compiler & runtime office hours call that I'd seen an approx 5% performance regression on the GoAWK benchmarks from 1.19 to 1.20 (thankfully PGO brings it back up to +5% above 1.19). I've dived a bit deeper now and have some real numbers, as well as the results of a git bisect.

First, the micro-benchmark numbers. You should be able to reproduce these by cloning https://github.com/benhoyt/goawk and running the following commands.

First run the interpreter benchmarks on both versions of Go (my "go" is 1.19.1, and the other version is 1.20rc1):

$ go test ./interp -bench=. -count=5 >benchmarks_1.19.txt
$ go1.20rc1 test ./interp -bench=. -count=5 >benchmarks_1.20.txt

Then use benchstat to compare them:

$ benchstat -sort=delta -geomean benchmarks_1.19.txt benchmarks_1.20.txt
name                     old time/op  new time/op   delta
BinaryOperators-16        249ns ± 3%    318ns ± 2%  +27.97%  (p=0.008 n=5+5)
IncrDecr-16              89.9ns ± 2%  111.2ns ± 1%  +23.73%  (p=0.008 n=5+5)
LocalVars-16              142ns ± 4%    172ns ± 2%  +21.26%  (p=0.008 n=5+5)
GlobalVars-16             141ns ± 3%    170ns ± 2%  +20.24%  (p=0.008 n=5+5)
ArrayOperations-16        324ns ± 4%    380ns ± 3%  +17.01%  (p=0.008 n=5+5)
ForLoop-16               1.41µs ± 4%   1.65µs ± 4%  +16.60%  (p=0.008 n=5+5)
ConcatSmall-16            202ns ± 1%    235ns ± 3%  +16.49%  (p=0.008 n=5+5)
ConcatTwo-16             51.3ns ± 2%   57.9ns ± 2%  +12.75%  (p=0.008 n=5+5)
SetField-16               464ns ± 2%    523ns ± 2%  +12.66%  (p=0.008 n=5+5)
BuiltinSubstr-16          459ns ± 4%    517ns ± 4%  +12.59%  (p=0.008 n=5+5)
CSVOutputGoAWK-16         245ns ± 2%    275ns ± 4%  +12.26%  (p=0.008 n=5+5)
Assign-16                 142ns ± 3%    159ns ± 2%  +11.84%  (p=0.008 n=5+5)
FuncCall-16              1.33µs ± 2%   1.48µs ± 2%  +11.73%  (p=0.008 n=5+5)
CondExpr-16              85.2ns ± 2%   94.3ns ± 3%  +10.78%  (p=0.008 n=5+5)
Printf-16                2.46µs ± 2%   2.66µs ± 5%   +8.27%  (p=0.008 n=5+5)
SimpleBuiltins-16         358ns ± 0%    384ns ± 3%   +7.20%  (p=0.016 n=4+5)
IfStatement-16           86.3ns ± 3%   91.3ns ± 4%   +5.77%  (p=0.008 n=5+5)
SimplePattern-16         84.4ns ± 3%   89.2ns ± 6%   +5.58%  (p=0.008 n=5+5)
RecursiveFunc-16         7.62µs ± 2%   8.02µs ± 3%   +5.26%  (p=0.008 n=5+5)
BuiltinLength-16          281ns ± 3%    292ns ± 2%   +4.00%  (p=0.032 n=5+5)
Print-16                 1.19µs ± 2%   1.23µs ± 1%   +3.82%  (p=0.008 n=5+5)
BuiltinGsubAmpersand-16  9.01µs ± 1%   9.24µs ± 1%   +2.59%  (p=0.008 n=5+5)
BuiltinIndex-16           652ns ± 4%    670ns ± 2%     ~     (p=0.222 n=5+5)
BuiltinSub-16            7.57µs ± 1%   7.59µs ± 1%     ~     (p=1.000 n=5+5)
BuiltinSubAmpersand-16   7.55µs ± 1%   7.65µs ± 2%     ~     (p=0.056 n=5+5)
BuiltinGsub-16           9.06µs ± 2%   9.17µs ± 2%     ~     (p=0.222 n=5+5)
BuiltinSprintf-16        5.13µs ± 1%   5.17µs ± 2%     ~     (p=0.548 n=5+5)
NativeFunc-16            6.24µs ± 1%   6.25µs ± 2%     ~     (p=0.841 n=5+5)
RegexMatch-16             603ns ± 3%    602ns ± 3%     ~     (p=0.548 n=5+5)
ConcatLarge-16           1.32ms ± 5%   1.25ms ± 9%     ~     (p=0.095 n=5+5)
Comparisons-16            224ns ± 4%    235ns ± 8%     ~     (p=0.222 n=5+5)
AugAssign-16              547ns ± 3%    564ns ± 3%     ~     (p=0.095 n=5+5)
RepeatNew-16             67.6ns ± 4%   69.5ns ± 3%     ~     (p=0.421 n=5+5)
RepeatIONew-16            645ns ± 2%    660ns ± 2%     ~     (p=0.095 n=5+5)
CSVInputGoAWK-16          243ns ± 6%    248ns ± 3%     ~     (p=0.548 n=5+5)
CSVInputReader-16         164ns ± 7%    164ns ±10%     ~     (p=1.000 n=5+5)
CSVOutputWriter-16        157ns ±13%    166ns ± 7%     ~     (p=0.095 n=5+5)
RepeatIOExecProgram-16   21.9µs ± 1%   21.3µs ± 0%   -2.79%  (p=0.008 n=5+5)
BuiltinMatch-16          1.46µs ± 2%   1.42µs ± 2%   -2.96%  (p=0.032 n=5+5)
ForInLoop-16             1.06µs ± 2%   1.02µs ± 1%   -3.79%  (p=0.008 n=5+5)
RepeatExecProgram-16     9.82µs ± 4%   9.40µs ± 2%   -4.22%  (p=0.016 n=5+5)
BuiltinSplitRegex-16     53.4µs ± 1%   50.2µs ± 2%   -6.03%  (p=0.008 n=5+5)
GetField-16               198ns ± 3%    184ns ± 2%   -7.40%  (p=0.008 n=5+5)
BuiltinSplitSpace-16     23.4µs ± 3%   21.6µs ± 2%   -7.63%  (p=0.008 n=5+5)
[Geo mean]                891ns         940ns        +5.50%

So overall a 5.5% increase in time taken (reduction in speed) from 1.19 to 1.20.

I won't copy the numbers here, but if I capture a CPU profile when running the benchmarks and build with PGO, that +5.50% changes to -5.55% compared to 1.19 (so 1.20 is about 10% faster with PGO compared to 1.20 without PGO). So at least I can recover the speed using a PGO build.

I then did a git bisect from commit 78fc810 (the latest on master) to e99f53f (the commit from which the 1.19 branch was taken), and ran the most-regressed benchmark, the BinaryOperators one, each time. Here's the benchmark code for that: https://github.com/benhoyt/goawk/blob/d7911647f3b08af099b50c9e991e72cfacdd1e2e/interp/interp_test.go#L2249

If the -count=5 results were close to Go 1.19's 249ns time I typed "git bisect good", if they were close to Go 1.20's 318ns time I typed "git bisect bad", and it ended up saying "833367e98af838a2511ee7e4e19dc8f1da7b8ed7 is the first bad commit". That's the one-liner CL https://go-review.googlesource.com/c/go/+/422235: "internal/buildcfg: enable unified IR by default".

(Because of noisy benchmarks it was a little tricky sometimes, but I think the results were fairly clear one way or the other at each step.)

I don't really know what effect the unified IR had, but maybe the compiler folks will?

I also tried a profile comparison using pprof's -diff_base option, but found the output a bit hard to understand.

The commands I used for the profile comparison were these (I believe):

$ go test ./interp -bench=BinaryOperators -count=5 -cpuprofile=bin_1.19.prof
$ go1.20rc1 test ./interp -bench=BinaryOperators -count=5 -cpuprofile=bin_1.20.prof
$ go tool pprof -diff_base bin_1.20.prof -http=:8080 bin_1.19.prof

It *looks* like the main increase was in the big VM loop, which is basically a big switch statement that switches on the GoAWK virtual machine opcode: https://github.com/benhoyt/goawk/blob/d7911647f3b08af099b50c9e991e72cfacdd1e2e/interp/vm.go#L30

I know the relatively recent "switch" optimization to use jump tables helped with large integer switch statements like this. Maybe the unified IR somehow caused a regression on the switch-with-jump-tables optimization?

Or maybe changes in inlining (which are presumably improvements in other contexts?) cause that execute() code to be really big and not play nicely with processor caches or something? That's my not very informed guessing.

Another set of benchmarks GoAWK has are the slightly more real world ones -- these don't use "go test -bench", but run various AWK scripts and AWK versions using a Python script and time them. For getter or worse, these are displayed in terms of speed relative to onetrueawk's time (time_of_onetrueawk / this_time), so bigger is better. Here are those results for 1.19 vs 1.20:

Test                         | goawk_1.19 | goawk_1.20 |      awk
---------------------------- | -------- | -------- | --------
tt.01 (print)                |     3.19 |     3.09 |     1.00
tt.02 (print NR NF)          |     1.99 |     1.95 |     1.00
tt.02a (print length)        |     1.81 |     1.80 |     1.00
tt.03 (sum length)           |     1.38 |     1.40 |     1.00
tt.03a (sum field)           |     1.44 |     1.42 |     1.00
tt.04 (printf fields)        |     1.18 |     1.15 |     1.00
tt.05 (concat fields)        |     1.63 |     1.50 |     1.00
tt.06 (count lengths)        |     1.48 |     1.47 |     1.00
tt.07 (even fields)          |     1.39 |     1.35 |     1.00
tt.08 (even lengths)         |     2.55 |     2.44 |     1.00
tt.08z (regex simple)        |     3.85 |     3.78 |     1.00
tt.09 (regex starts with)    |     2.73 |     2.69 |     1.00
tt.10 (regex ends with)      |     0.41 |     0.43 |     1.00
tt.10a (regex ends with var) |     0.39 |     0.42 |     1.00
tt.11 (substr)               |     3.17 |     3.17 |     1.00
tt.12 (update fields)        |     1.61 |     1.51 |     1.00
tt.13 (array ops)            |     4.42 |     4.19 |     1.00
tt.13a (array printf)        |     2.83 |     2.69 |     1.00
tt.14 (function call)        |     1.62 |     1.42 |     1.00
tt.15 (format lines)         |     0.89 |     0.80 |     1.00
tt.16 (count words)          |     1.68 |     1.54 |     1.00
tt.big (complex program)     |     2.05 |     1.97 |     1.00
tt.x1 (mandelbrot)           |     2.08 |     1.69 |     1.00
tt.x2 (sum loop)             |     2.05 |     1.69 |     1.00
---------------------------- | -------- | -------- | --------
**Geo mean**                 | **1.73** | **1.66** | **1.00**

In any case, I hope that's helpful. Let me know if something's not clear or if I can provide other numbers to help.

For reference, I'm on Linux (Ubuntu 22.04) on amd64 using a 12th Gen Intel Core i7-1260P × 16.

Cheers,
Ben

Michael Pratt

unread,
Dec 21, 2022, 11:34:36 AM12/21/22
to ben...@gmail.com, golang-dev, Matthew Dempsky
Hi Ben,

Thanks for the detailed analysis! Could you try building with 1.20rc1 with GOEXPERIMENT=nounified set? That should be conceptually the same as reverting CL 422235, so if that fixes the regression then it is good confirmation of the bisect result.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/33e02a77-b872-40b8-b8d9-4cd5ac25d972n%40googlegroups.com.

Ben Hoyt

unread,
Dec 21, 2022, 3:43:43 PM12/21/22
to Michael Pratt, golang-dev, Matthew Dempsky
Thanks -- yes, it looks like that has the same effect! For the BinaryOperators benchmark, it's around 340ns/op normally (with unified IR), and 240ns/op with "nounified" set. Results below:

$ go1.20rc1 test ./interp -bench=BinaryOperators -count=5
goos: linux
goarch: amd64
pkg: github.com/benhoyt/goawk/interp
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkBinaryOperators-16      3337188               353.7 ns/op
BenchmarkBinaryOperators-16      3801718               346.9 ns/op
BenchmarkBinaryOperators-16      4011024               339.9 ns/op
BenchmarkBinaryOperators-16      3440431               312.1 ns/op
BenchmarkBinaryOperators-16      3969404               344.1 ns/op
PASS
ok      github.com/benhoyt/goawk/interp 12.621s

$ GOEXPERIMENT=nounified go1.20rc1 test ./interp -bench=BinaryOperators -count=5
goos: linux
goarch: amd64
pkg: github.com/benhoyt/goawk/interp
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkBinaryOperators-16      5058613               235.9 ns/op
BenchmarkBinaryOperators-16      4932649               244.1 ns/op
BenchmarkBinaryOperators-16      5177004               236.5 ns/op
BenchmarkBinaryOperators-16      5037283               235.2 ns/op
BenchmarkBinaryOperators-16      5116255               261.9 ns/op
PASS
ok      github.com/benhoyt/goawk/interp 12.035s

When I run all the microbenchmarks with "nounified" set, 1.20 is definitely better than before, though still 2% slower than 1.19 (geo mean) ... different benchmarks are at the top, though:

$ benchstat -sort=delta -geomean benchmarks_1.19.txt benchmarks_1.20_nounified.txt

name                     old time/op  new time/op  delta
ConcatLarge-16           1.32ms ± 5%  1.46ms ± 6%  +10.25%  (p=0.016 n=5+5)
Printf-16                2.46µs ± 2%  2.70µs ± 2%  +10.00%  (p=0.008 n=5+5)
SimplePattern-16         84.4ns ± 3%  92.2ns ± 5%   +9.23%  (p=0.008 n=5+5)
SimpleBuiltins-16         358ns ± 0%   389ns ± 7%   +8.60%  (p=0.016 n=4+5)
ForLoop-16               1.41µs ± 4%  1.51µs ± 2%   +7.09%  (p=0.008 n=5+5)
SetField-16               464ns ± 2%   494ns ± 1%   +6.50%  (p=0.008 n=5+5)
ConcatTwo-16             51.3ns ± 2%  54.6ns ± 4%   +6.48%  (p=0.008 n=5+5)
Print-16                 1.19µs ± 2%  1.26µs ± 2%   +5.89%  (p=0.008 n=5+5)
ConcatSmall-16            202ns ± 1%   213ns ± 4%   +5.49%  (p=0.008 n=5+5)
FuncCall-16              1.33µs ± 2%  1.38µs ± 4%   +4.21%  (p=0.016 n=5+5)
RepeatIOExecProgram-16   21.9µs ± 1%  22.6µs ± 3%   +3.34%  (p=0.032 n=5+5)
BuiltinSprintf-16        5.13µs ± 1%  5.26µs ± 2%   +2.59%  (p=0.016 n=5+5)
BuiltinGsubAmpersand-16  9.01µs ± 1%  9.17µs ± 2%   +1.83%  (p=0.040 n=5+5)
NativeFunc-16            6.24µs ± 1%  6.33µs ± 1%   +1.35%  (p=0.016 n=5+5)
GlobalVars-16             141ns ± 3%   147ns ± 6%     ~     (p=0.206 n=5+5)
LocalVars-16              142ns ± 4%   149ns ± 7%     ~     (p=0.056 n=5+5)
IncrDecr-16              89.9ns ± 2%  93.7ns ± 8%     ~     (p=0.151 n=5+5)
BuiltinMatch-16          1.46µs ± 2%  1.48µs ± 1%     ~     (p=0.175 n=5+5)
BuiltinLength-16          281ns ± 3%   274ns ± 8%     ~     (p=1.000 n=5+5)
BuiltinIndex-16           652ns ± 4%   625ns ± 7%     ~     (p=0.548 n=5+5)
BuiltinSubstr-16          459ns ± 4%   455ns ± 7%     ~     (p=0.690 n=5+5)
BuiltinSub-16            7.57µs ± 1%  7.51µs ± 1%     ~     (p=0.310 n=5+5)
BuiltinSubAmpersand-16   7.55µs ± 1%  7.52µs ± 2%     ~     (p=0.548 n=5+5)
BuiltinGsub-16           9.06µs ± 2%  9.17µs ± 1%     ~     (p=0.151 n=5+5)
RecursiveFunc-16         7.62µs ± 2%  7.43µs ± 4%     ~     (p=0.095 n=5+5)
ForInLoop-16             1.06µs ± 2%  1.07µs ± 2%     ~     (p=0.198 n=5+5)
IfStatement-16           86.3ns ± 3%  84.0ns ± 7%     ~     (p=0.690 n=5+5)
CondExpr-16              85.2ns ± 2%  85.0ns ± 7%     ~     (p=0.690 n=5+5)
GetField-16               198ns ± 3%   199ns ± 9%     ~     (p=0.690 n=5+5)
RegexMatch-16             603ns ± 3%   603ns ± 8%     ~     (p=0.690 n=5+5)
BinaryOperators-16        249ns ± 3%   248ns ± 7%     ~     (p=0.690 n=5+5)
Comparisons-16            224ns ± 4%   219ns ± 8%     ~     (p=1.000 n=5+5)
ArrayOperations-16        324ns ± 4%   322ns ± 7%     ~     (p=0.690 n=5+5)
Assign-16                 142ns ± 3%   147ns ± 6%     ~     (p=0.421 n=5+5)
AugAssign-16              547ns ± 3%   546ns ± 6%     ~     (p=0.690 n=5+5)
RepeatExecProgram-16     9.82µs ± 4%  9.76µs ± 2%     ~     (p=1.000 n=5+5)
RepeatNew-16             67.6ns ± 4%  72.0ns ± 9%     ~     (p=0.151 n=5+5)
RepeatIONew-16            645ns ± 2%   661ns ± 3%     ~     (p=0.095 n=5+5)
CSVInputGoAWK-16          243ns ± 6%   238ns ± 6%     ~     (p=0.310 n=5+5)
CSVInputReader-16         164ns ± 7%   158ns ±20%     ~     (p=0.310 n=5+5)
CSVOutputGoAWK-16         245ns ± 2%   258ns ± 6%     ~     (p=0.095 n=5+5)
CSVOutputWriter-16        157ns ±13%   167ns ± 6%     ~     (p=0.151 n=5+5)
BuiltinSplitRegex-16     53.4µs ± 1%  51.0µs ± 1%   -4.54%  (p=0.008 n=5+5)
BuiltinSplitSpace-16     23.4µs ± 3%  21.8µs ± 2%   -7.08%  (p=0.008 n=5+5)
[Geo mean]                891ns        908ns        +1.92%

-Ben

cuong.m...@gmail.com

unread,
Dec 22, 2022, 3:54:13 PM12/22/22
to golang-dev
Hi Ben,

I think this is the same issue with https://github.com/golang/go/issues/57434.

Would you mind applying https://go-review.googlesource.com/c/go/+/459295 and check if it fixed the benchmark. Thank you!

Best,
Cuong

ben...@gmail.com

unread,
Dec 22, 2022, 4:22:29 PM12/22/22
to golang-dev
It seems that CL doesn't make a difference (compared to 1.20 master). I just tried with "gotip download 459295", but it seems to be the same as "gotip download" (master), and still ~23% slower than 1.19:

Here are the results:

# Version 1.19.1: best 229 ns/op
$ go version
go version go1.19.1 linux/amd64
$ go test -run=^$ -bench=BinaryOperators -count=5 -benchmem ./interp

goos: linux
goarch: amd64
pkg: github.com/benhoyt/goawk/interp
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkBinaryOperators-16         4826552           255.0 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         5005927           229.1 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         5075928           230.7 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4612962           256.4 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         5148254           237.7 ns/op           0 B/op           0 allocs/op
PASS
ok      github.com/benhoyt/goawk/interp    7.199s

# Go 1.20 latest master: best 282 ns/op
$ gotip download
...
$ gotip version
go version devel go1.20-c61d322 Thu Dec 22 18:04:06 2022 +0000 linux/amd64
$ gotip test -run=^$ -bench=BinaryOperators -count=5 -benchmem ./interp

goos: linux
goarch: amd64
pkg: github.com/benhoyt/goawk/interp
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkBinaryOperators-16         3946436           303.4 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4391176           284.3 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4206294           282.3 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4222044           289.6 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4173250           312.5 ns/op           0 B/op           0 allocs/op
PASS
ok      github.com/benhoyt/goawk/interp    7.628s

# Go 1.20 with CL 459295 applied: best 281 ns/op (same as 1.20 master)
$ gotip download 459295
...
$ gotip version
go version devel go1.20-b6ca61f Fri Dec 23 03:59:37 2022 +0700 linux/amd64
0 ~/h/goawk$ gotip test -run=^$ -bench=BinaryOperators -count=5 -benchmem ./interp

goos: linux
goarch: amd64
pkg: github.com/benhoyt/goawk/interp
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkBinaryOperators-16         3799864           288.1 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4110276           282.6 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         4239718           313.8 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         3761960           312.8 ns/op           0 B/op           0 allocs/op
BenchmarkBinaryOperators-16         3823831           281.1 ns/op           0 B/op           0 allocs/op
PASS
ok      github.com/benhoyt/goawk/interp    7.391s

-Ben

cuong.m...@gmail.com

unread,
Dec 22, 2022, 4:30:19 PM12/22/22
to golang-dev
Hi Ben,

Would you mind applying latest patch set 4. I messed up my local CL and not applying the right fix.

Best,
Cuong

Ben Hoyt

unread,
Dec 22, 2022, 4:40:54 PM12/22/22
to cuong.m...@gmail.com, golang-dev
Just tried patch set 4, and unfortunately no change for this issue. Still around the ~290ns/op mark:

$ gotip version
go version devel go1.20-b5159f8 Fri Dec 23 04:27:00 2022 +0700 linux/amd64
$ gotip test -run=^$ -bench=BinaryOperators -count=5 -benchmem ./interp
goos: linux
goarch: amd64
pkg: github.com/benhoyt/goawk/interp
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkBinaryOperators-16     3831655       288.2 ns/op       0 B/op       0 allocs/op
BenchmarkBinaryOperators-16     4268756       282.7 ns/op       0 B/op       0 allocs/op
BenchmarkBinaryOperators-16     4226371       315.1 ns/op       0 B/op       0 allocs/op
BenchmarkBinaryOperators-16     3671364       317.4 ns/op       0 B/op       0 allocs/op
BenchmarkBinaryOperators-16     4356140       313.5 ns/op       0 B/op       0 allocs/op
PASS
ok   github.com/benhoyt/goawk/interp 7.685s

-Ben

You received this message because you are subscribed to a topic in the Google Groups "golang-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-dev/8blU7CnuhZ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/bb83818e-6af4-412f-b709-f4d0554683ean%40googlegroups.com.

cuong.m...@gmail.com

unread,
Dec 22, 2022, 4:43:08 PM12/22/22
to golang-dev
Hi Ben,

Sorry to not read your message carefully. If the regression is from 1.19 to 1.20, then https://go-review.googlesource.com/c/go/+/459295 seems not to be the fix.

Best,
Cuong

On Friday, December 23, 2022 at 4:22:29 AM UTC+7 ben...@gmail.com wrote:

Keith Randall

unread,
Jan 3, 2023, 12:32:53 PM1/3/23
to golang-dev
This looks to be a difference in inlining, probably due to slightly different IR from the unified frontend.
If I add a go:noinline to goawk/interp/vm.go:value.num, the differences go away (the performance drops for 1.19 to the 1.20 level).


Keith Randall

unread,
Jan 3, 2023, 12:55:49 PM1/3/23
to golang-dev
Looks like this is because (*interp).execute is considered "big". When a function is deemed big, we lower the size threshold we use to decide when to inline things into it.
Before unified IR, the size score for (*interp).execute is 4306. After, it is 5096. The threshold for considering a function big is 5000.
value.num is on the cusp, small enough for normal inlining but too large for into-big inlining.

This also explains why PGO fixes things, as it probably forces the inlining regardless.

Than McIntosh

unread,
Jan 3, 2023, 12:59:01 PM1/3/23
to Keith Randall, golang-dev

I wonder whether it might be worth adding this info to the "-m" output (e.g. whether a given function is considered "big" or not). I might try to send a CL for that.

Than


Keith Randall

unread,
Jan 3, 2023, 1:28:33 PM1/3/23
to golang-dev
The extra nodes appear to be for the implementation of multi-assignment from a function call. For instance, goawk/interp/vm.go:61 has this line:
            l, r := p.peekTwo()

Without unified, we get this IR:

.   .   .   .   CASE # vm.go:60:3
.   .   .   .   CASE-List
.   .   .   .   .   LITERAL-5 compiler.Opcode tc(1) # vm.go:60:16
.   .   .   .   CASE-Body
.   .   .   .   .   AS2FUNC-init
.   .   .   .   .   .   DCL # vm.go:61:4
.   .   .   .   .   .   .   NAME-interp.l Class:PAUTO Offset:0 OnStack interp.value tc(1) # vm.go:61:4
.   .   .   .   .   .   DCL # vm.go:61:7
.   .   .   .   .   .   .   NAME-interp.r Class:PAUTO Offset:0 OnStack interp.value tc(1) # vm.go:61:7
.   .   .   .   .   AS2FUNC Def tc(1) # vm.go:61:9
.   .   .   .   .   AS2FUNC-Lhs
.   .   .   .   .   .   NAME-interp.l Class:PAUTO Offset:0 OnStack interp.value tc(1) # vm.go:61:4
.   .   .   .   .   .   NAME-interp.r Class:PAUTO Offset:0 OnStack interp.value tc(1) # vm.go:61:7
.   .   .   .   .   AS2FUNC-Rhs
.   .   .   .   .   .   CALLFUNC STRUCT-(value, value) tc(1) # vm.go:61:21
.   .   .   .   .   .   .   METHEXPR interp.peekTwo FUNC-func(*interp) (value, value) tc(1) # vm.go:61:13
.   .   .   .   .   .   .   .   TYPE <S> type PTR-*interp tc(1)
.   .   .   .   .   .   CALLFUNC-Args
.   .   .   .   .   .   .   NAME-interp.p Class:PPARAM Offset:0 OnStack PTR-*interp tc(1) # vm.go:30:7

With unified, we get this:

.   .   .   .   CASE # vm.go:60:3
.   .   .   .   CASE-List
.   .   .   .   .   LITERAL-5 compiler.Opcode tc(1) # vm.go:60:16
.   .   .   .   CASE-Body
.   .   .   .   .   AS2-init
.   .   .   .   .   .   DCL # vm.go:61:4
.   .   .   .   .   .   .   NAME-interp.l Class:PAUTO Offset:0 OnStack Used interp.value tc(1) # vm.go:61:4
.   .   .   .   .   .   DCL # vm.go:61:7
.   .   .   .   .   .   .   NAME-interp.r Class:PAUTO Offset:0 OnStack Used interp.value tc(1) # vm.go:61:7
.   .   .   .   .   AS2 Def tc(1) # vm.go:61:9
.   .   .   .   .   AS2-Lhs
.   .   .   .   .   .   NAME-interp.l Class:PAUTO Offset:0 OnStack Used interp.value tc(1) # vm.go:61:4
.   .   .   .   .   .   NAME-interp.r Class:PAUTO Offset:0 OnStack Used interp.value tc(1) # vm.go:61:7
.   .   .   .   .   AS2-Rhs
.   .   .   .   .   .   CONVNOP-init
.   .   .   .   .   .   .   AS2FUNC-init
.   .   .   .   .   .   .   .   DCL # vm.go:61:9
.   .   .   .   .   .   .   .   .   NAME-interp..autotmp_11 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used interp.value tc(1) # vm.go:61:9
.   .   .   .   .   .   .   .   DCL # vm.go:61:9
.   .   .   .   .   .   .   .   .   NAME-interp..autotmp_12 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used interp.value tc(1) # vm.go:61:9
.   .   .   .   .   .   .   AS2FUNC Def tc(1) # vm.go:61:9
.   .   .   .   .   .   .   AS2FUNC-Lhs
.   .   .   .   .   .   .   .   NAME-interp..autotmp_11 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used interp.value tc(1) # vm.go:61:9
.   .   .   .   .   .   .   .   NAME-interp..autotmp_12 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used interp.value tc(1) # vm.go:61:9
.   .   .   .   .   .   .   AS2FUNC-Rhs
.   .   .   .   .   .   .   .   CALLFUNC STRUCT-(value, value) tc(1) # vm.go:61:21
.   .   .   .   .   .   .   .   .   METHEXPR interp.peekTwo FUNC-func(*interp) (value, value) tc(1) # vm.go:61:13
.   .   .   .   .   .   .   .   .   .   TYPE <S> type PTR-*interp tc(1)
.   .   .   .   .   .   .   .   CALLFUNC-Args
.   .   .   .   .   .   .   .   .   NAME-interp.p Class:PPARAM Offset:0 OnStack Used PTR-*interp tc(1) # vm.go:30:7
.   .   .   .   .   .   CONVNOP interp.value tc(1) # vm.go:61:7
.   .   .   .   .   .   .   NAME-interp..autotmp_11 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used interp.value tc(1) # vm.go:61:9
.   .   .   .   .   .   NAME-interp..autotmp_12 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used interp.value tc(1) # vm.go:61:9

There seems to be some extra temporaries and assignments being used here, probably to introduce the conversions in case the return values and the assigned variables don't have the same type.

Probably the simplest thing to do for 1.20 is bump the threshold for "big" up a bit. We probably want to avoid any nonbig->big transitions when upgrading from 1.19 to 1.20.
The actual big threshold doesn't matter much, it is just there to guard against blowing code size up too much.
Reply all
Reply to author
Forward
0 new messages