performance regression

675 views
Skip to first unread message

robert engels

unread,
Sep 29, 2019, 1:37:51 PM9/29/19
to golan...@googlegroups.com
Hello,

I believe there has been a significant performance regression in later Go releases, specifically 1.13 as compared to 1.9.

This may be limited to OSX. I am running OSX 10.13.6 on a Core i7.

I maintain several projects that are fairly static and I recently ran some of the performance benchmarks to see how things have improved in later Go releases. Since all tests are run on the same hardware/OS config it certainly seems a Go issue. I ran the tests multiple times under each version and the numbers reported below are representative.

For the first test, see performance.go which performs extensive file IO and binary record processing:

under 1.9.7:

insert time  1000000 records =  2689 ms, usec per op  2
close time  3529 ms
scan time  2319 ms, usec per op  2
scan time 50%  1042 ms, usec per op  2
random access time  13 us per get
close with merge 1 time  2 ms
scan time  2270 ms, usec per op  2
scan time 50%  1055 ms, usec per op  2
random access time  13 us per get

Under 1.13:

insert time  1000000 records =  3150 ms, usec per op  3
close time  5954 ms
scan time  4665 ms, usec per op  4
scan time 50%  2158 ms, usec per op  4
random access time  24 us per get
close with merge 1 time  0 ms
scan time  4690 ms, usec per op  4
scan time 50%  2104 ms, usec per op  4
random access time  23 us per get

So a 100% slowdown in most cases.

Similarly, if you review this fixed_bench_test.go (this test performs no IO, but very common numeric byte/string transformations). This shows a far less slowdown, but I would assume 10% for a micro-bench is still significant.

Under 1.9.7:

BenchmarkStringFixed-8       20000000        76.2 ns/op
BenchmarkStringNFixed-8     20000000        74.6 ns/op

Under 1.13:

BenchmarkStringFixed-8       14216155        82.2 ns/op
BenchmarkStringNFixed-8     14680034        80.6 ns/op


peterGo

unread,
Sep 29, 2019, 4:16:28 PM9/29/19
to golang-dev
robert engels,

$ go1.9 version
go version go1.9.7 linux/amd64
$ go1.9 test fixed_bench_test.go -bench=String.?Fixed -benchmem
BenchmarkStringFixed-4        20000000   63.7 ns/op   32 B/op   1 allocs/op
BenchmarkStringNFixed-4       20000000   62.9 ns/op   32 B/op   1 allocs/op

$ go1.13 version
go version go1.13.1 linux/amd64
$ go1.13 test fixed_bench_test.go -bench=String.?Fixed -benchmem
BenchmarkStringFixed-4        18455223   62.9 ns/op   32 B/op   1 allocs/op
BenchmarkStringNFixed-4       18631693   62.2 ns/op   32 B/op   1 allocs/op

$ go version
go version devel +931365763a Sun Sep 29 14:27:05 2019 +0000 linux/amd64
$ go test fixed_bench_test.go -bench=String.?Fixed -benchmem
BenchmarkStringFixed-4        18431575   62.9 ns/op   32 B/op   1 allocs/op
BenchmarkStringNFixed-4       18747723   62.3 ns/op   32 B/op   1 allocs/op


Peter

Robert Engels

unread,
Sep 29, 2019, 4:25:52 PM9/29/19
to peterGo, golang-dev
Thanks. So you are giving evidence that it is a regression on OS X only. 
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/a1e128f0-e097-407f-8e7d-3e0d52c2cf14%40googlegroups.com.

peterGo

unread,
Sep 29, 2019, 4:54:41 PM9/29/19
to golang-dev
robert engels,


$ go1.9 version
go version go1.9.7 linux/amd64
$ go1.9 build performance.go
$ time ./performance
insert time  1000000 records =  2321 ms, usec per op  2
close time  2284 ms
scan time  993 ms, usec per op  0
scan time 50%  429 ms, usec per op  0
random access time  6 us per get

close with merge 1 time  0 ms
scan time  994 ms, usec per op  0
scan time 50%  428 ms, usec per op  0
random access time  6 us per get
real    0m10.659s
user    0m6.729s
sys        0m1.713s


$ go1.13 version
go version go1.13.1 linux/amd64
$ go1.13 build performance.go
$ time ./performance
insert time  1000000 records =  2178 ms, usec per op  2
close time  2310 ms
scan time  966 ms, usec per op  0
scan time 50%  416 ms, usec per op  0
random access time  6 us per get

close with merge 1 time  0 ms
scan time  969 ms, usec per op  0
scan time 50%  421 ms, usec per op  0
random access time  6 us per get
real    0m10.514s
user    0m6.284s
sys        0m1.734s


$ go version
go version devel +931365763a Sun Sep 29 14:27:05 2019 +0000 linux/amd64
$ go build performance.go
$ time ./performance
insert time  1000000 records =  2275 ms, usec per op  2
close time  2247 ms
scan time  967 ms, usec per op  0
scan time 50%  422 ms, usec per op  0
random access time  6 us per get

close with merge 1 time  0 ms
scan time  963 ms, usec per op  0
scan time 50%  421 ms, usec per op  0
random access time  6 us per get
real    0m10.550s
user    0m6.253s
sys        0m1.779s


Peter

On Sunday, September 29, 2019 at 1:37:51 PM UTC-4, robert engels wrote:

peterGo

unread,
Sep 29, 2019, 5:00:29 PM9/29/19
to golang-dev
robert engels,

To paraphrase, there are lies, damn lies, and benchmarks.

Benchmarks are scientific experiments. Results should be reproducible. I'm unable to reproduce your results.

peter

On Sunday, September 29, 2019 at 1:37:51 PM UTC-4, robert engels wrote:

andrey mirtchovski

unread,
Sep 29, 2019, 5:05:09 PM9/29/19
to peterGo, golang-dev
I think go switched from direct syscalls to using libSystem since 1.9 on macOS. that will introduce the necessary overhead.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

andrey mirtchovski

unread,
Sep 29, 2019, 5:55:32 PM9/29/19
to peterGo, golang-dev
I think the relevant issue is
https://github.com/golang/go/issues/17490 (was not able to find it on
phone). It was scheduled for inclusion in 1.10, meaning it postdates
1.9.

Robert Engels

unread,
Sep 29, 2019, 5:57:41 PM9/29/19
to peterGo, golang-dev
Stop being so aggressive Peter. It is clearly a OS X related issue which is why you cannot reproduce - and it is always what I expected in my OP. 

Robert Engels

unread,
Sep 29, 2019, 5:58:54 PM9/29/19
to andrey mirtchovski, peterGo, golang-dev
That would appear to be an unwise choice IMO. Yes it lowers maintenance costs but it is a pretty large runtime cost expense. 

keith....@gmail.com

unread,
Sep 29, 2019, 6:29:45 PM9/29/19
to golang-dev
We really had to switch to using libc instead of direct syscalls on OSX.
Correctness is 1000x more important than performance.
This issue is particularly problematic because changes between one existing version of OSX and the next might (and did) break every existing Go binary. Not just new ones built to the new OS version, but *all* existing binaries. That's not really a tenable position for Go to be in.

All that said, it would be good to verify that the performance problems you're seeing are really related to this change and not something else.

The overhead of the libc implementation is probably more than it needs to be. People are welcome to work on improvements.


On Sunday, September 29, 2019 at 2:58:54 PM UTC-7, Robert Engels wrote:
That would appear to be an unwise choice IMO. Yes it lowers maintenance costs but it is a pretty large runtime cost expense. 

On Sep 29, 2019, at 4:04 PM, andrey mirtchovski <mirtc...@gmail.com> wrote:

I think go switched from direct syscalls to using libSystem since 1.9 on macOS. that will introduce the necessary overhead.

To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Robert Engels

unread,
Sep 29, 2019, 6:50:54 PM9/29/19
to keith....@gmail.com, golang-dev
It’s not a joke if he states “Results should be reproducible. I'm unable to reproduce your results.”  He owes me an apology. 
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/29602a01-6c6a-4205-9c02-689e817a87f3%40googlegroups.com.

Robert Engels

unread,
Sep 29, 2019, 6:53:03 PM9/29/19
to keith....@gmail.com, golang-dev
And if you even read 2 sentences into what you linked to “particularly the use of statistics to bolster weak arguments.” You owe me an apology as well. 

Robert Engels

unread,
Sep 29, 2019, 6:57:24 PM9/29/19
to keith....@gmail.com, golang-dev
Thank you for your comments. But doesn’t this mean that every Mac binary would also suffer performance degradation & correctness issues between releases (if they used these apis)? That is not my experience is doing OS X development. 

On Sep 29, 2019, at 5:29 PM, keith....@gmail.com wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/29602a01-6c6a-4205-9c02-689e817a87f3%40googlegroups.com.

Emmanuel Odeke

unread,
Sep 29, 2019, 7:11:53 PM9/29/19
to Robert Engels, keith....@gmail.com, golang-dev
So for Go1.12, we updated *internal/poll.FD.Sync to use the F_FULLFSYNC fcntl on macOS for the sake
of correctness in flushing to permanent storage as per CL https://go-review.googlesource.com/130676 and issue https://github.com/golang/go/issues/26650
The fsync docs caution about the need to correctly use the proper F_FULLFSYNC fcntl in order to avoid file corruption.

That change caused dramatic reports of degradation. Perhaps that can help account for your "close time" by proxy of invoking fd.Sync() and thus the increase of ~69%

Robert Engels

unread,
Sep 29, 2019, 7:26:26 PM9/29/19
to Emmanuel Odeke, keith....@gmail.com, golang-dev
I am more concerned with the random access time performance degradation. Operations like these seem far more common. These are also read only so I’m not sure this is applicable. 

Matthew Dempsky

unread,
Sep 29, 2019, 9:55:47 PM9/29/19
to Robert Engels, keith....@gmail.com, golang-dev
On Sun, Sep 29, 2019, 3:57 PM Robert Engels <ren...@ix.netcom.com> wrote:
But doesn’t this mean that every Mac binary would also suffer performance degradation & correctness issues between releases (if they used these apis)?

Keith's conjecture is that the slow down is from switching from (1) directly issuing system calls using x86 instructions to (2) indirectly issuing system calls using calls to libSystem (which then uses the x86 instructions).

He's not conjecturing that libSystem will continue getting slower over time. Like you say, most binaries using libSystem and do not get slower over time.

But the switch to libSystem was unavoidable. While direct system calls mostly worked, Apple never supported it, and eventually it became unmaintainable.

Robert Engels

unread,
Sep 29, 2019, 10:11:03 PM9/29/19
to Matthew Dempsky, keith....@gmail.com, golang-dev
I understand that - it just seems hard to believe that an api layer like libSystem can cause a 2x slowdown for IO calls - which smells to me that there is something else happening. For the random access calls 13 usec to 24 usec is a lot of overhead. It’s hard to consume a single usec in a syscall translation. 
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Matthew Dempsky

unread,
Sep 29, 2019, 10:42:00 PM9/29/19
to Robert Engels, keith....@gmail.com, golang-dev
Agreed, that does seem like a lot of overhead to attribute to some extra indirection for system calls, but I'm not familiar with the details.

However, it sounds like you're able to reproduce this performance regression. It would be great if you could try bisecting Git commit history to figure out a narrower time frame. Two years of Go development is a long time for us to hypothesize about why things may have slowed down so measurably. Even just measuring each release between then would be useful data to start with.

Cheers

Robert Engels

unread,
Sep 29, 2019, 10:55:40 PM9/29/19
to Matthew Dempsky, keith....@gmail.com, golang-dev
I will start with the version test and report back. Hopefully later this week. Last time I tried building the Go distribution it didn’t go well so the cherry pick might be tough but I’ll see what I can do. 

robert engels

unread,
Sep 30, 2019, 1:34:33 AM9/30/19
to andrey mirtchovski, peterGo, golan...@googlegroups.com
I can confirm that the major regression issue does not occur in 1.11.13 but occurs in 1.12.19.

Still, I don’t think the https://github.com/golang/go/issues/30497 tells the whole story, since as I pointed out the performance is 2x worse for the random read case which does not use fd.Sync(). It is difficult to imagine how an syscall translation layer for IO calls can cause this sort of degradation. I am going to create a bench test to isolate this so I can check the pprof and allocations.

I am also going to look into the minor 10% regression on the CPU bound benchmarks later this week.


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

minux

unread,
Sep 30, 2019, 2:01:50 AM9/30/19
to golan...@googlegroups.com
On Mon, Sep 30, 2019 at 1:34 AM robert engels <ren...@ix.netcom.com> wrote:
> I can confirm that the major regression issue does not occur in 1.11.13 but occurs in 1.12.19.

it almost certainly is the libSystem syscall change.
Changing to use libSystem.dylib is not as simple as just changing a
syscall to calling the respective function in libSystem.dylib.
Go need to switch to system stack and use the cgo mechanism to call
foreign C functions whereas in the pre-1.12 direct syscall case, Go
program can issue the syscall directly with the cgo overhead.

so the tl;dr version is: the performance degradation you observed is
from the use of cgo mechanism to call libSystem.dylib.

Robert Engels

unread,
Sep 30, 2019, 7:54:42 AM9/30/19
to minux, golan...@googlegroups.com
Thanks for the input but something doesn’t seem right. I need to review the code but I’m pretty sure there is probably a single IO operation maybe 2 in this code path - so that would mean CGo is adding 7 usec per call. Seems too high. Those times seem more like a context switch, so maybe the fast path non multiplexing IO was lost?
> --
> You received this message because you are subscribed to the Google Groups "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/CA%2Bdb%3Dn0R%2BRxE9S-BXK43ofnpgmi4J9N_4gRZoT2y6u_SWL6w%3Dw%40mail.gmail.com.

Bryan C. Mills

unread,
Sep 30, 2019, 10:24:11 AM9/30/19
to Robert Engels, minux, golan...@googlegroups.com
A cgo call is very much like a context switch. The runtime switches execution to a different stack and writes some scheduling metadata, which may introduce cache or lock contention on scheduler. (See https://golang.org/issue/19574 and related issues.)

minux

unread,
Sep 30, 2019, 10:30:43 AM9/30/19
to golan...@googlegroups.com
TL;DR I found a probably unintended change in the libSystem.dylib CL, so I've sent https://golang.org/cl/197938
to partially restore the syscall performance to Go 1.11 level.

See below for a more detailed explanation of this problem.


On Mon, Sep 30, 2019 at 7:54 AM Robert Engels <ren...@ix.netcom.com> wrote:
>
> Thanks for the input but something doesn’t seem right. I need to review the code but I’m pretty sure there is probably a single IO operation maybe 2 in this code path - so that would mean CGo is adding 7 usec per call. Seems too high. Those times seem more like a context switch, so maybe the fast path non multiplexing IO was lost?

To demonstrate the syscall performance degradation due to the libSystem.dylib change, I wrote this:
https://play.golang.org/p/e7PiTSttMOL
It uses three different mechanisms to call the read(2) syscall with an obviously EBADF result, so the
syscall itself should be fast (and non-blocking).

On darwin/amd64,
$ go1.11.13 go build tmp.go
$ ./tmp -test.bench=. -test.benchmem
goos: darwin
goarch: amd64
BenchmarkCgo-12           2000000       626 ns/op       0 B/op       0 allocs/op
BenchmarkSyscall-12       3000000       615 ns/op       8 B/op       1 allocs/op
BenchmarkSyscallOld-12     3000000       517 ns/op       0 B/op       0 allocs/op
PASS

$ go1.12.10 build tmp.go
$ ./tmp -test.bench=. -test.benchmem
goos: darwin
goarch: amd64
BenchmarkCgo-12           2000000       655 ns/op       0 B/op       0 allocs/op
BenchmarkSyscall-12       1000000      1347 ns/op       8 B/op       1 allocs/op
BenchmarkSyscallOld-12     3000000       538 ns/op       0 B/op       0 allocs/op
PASS

You can see that the Read syscall degrades by ~1x:
benchmark               old ns/op     new ns/op     delta
BenchmarkCgo-12            626           655           +4.63%
BenchmarkSyscall-12        615           1347          +119.02%
BenchmarkSyscallOld-12     517           538           +4.06%

If your benchmark is syscall heavy, it won't be surprising that it gets significantly slower with go 1.12 on darwin.

That being said, why is Go 1.12 syscall 2x slower than an equivalent cgo call? The cgo performance is unchanged.
It doesn't seem related to memory allocation (both version allocate 8 byte per syscall).

We need to dig deeper.

in Go 1.11, Syscall.Read basically does what BenchmarkSyscallOld does -- syscall.Syscall(SYS_READ, ...), and
calls runtime.entersyscall and then invoke the syscall directly via the SYSCALL instruction.

and it calls syscall.syscall, which is defined in runtime, https://github.com/golang/go/blob/master/src/runtime/sys_darwin.go#L59,
and it just uses libcCall (https://github.com/golang/go/blob/master/src/runtime/sys_darwin.go#L14) to call the libSystem function.
libcCall eventually calls asmcgocall, which handles the ABI and stack switching to call foreign functions from Go.

careful study of this call chain revealed one crucial difference between syscall.Syscall(syscall.SYS_READ) and syscall.Read:
the former uses runtime.entersyscall() where as the later calls runtime.entersyscallblock().

To see if that's the underlying reason that is causing the 1x slow down, let's modify syscall.Syscall to call runtime.entersyscallblock
as well. And indeed, now BenchmarkSyscallOld is also 1x slower. I can also reproduce this slowdown on similarly modified
Go 1.11 as well.

Therefore, the conclusion is: the slow down for darwin syscalls in Go 1.12 is due to the libSystem change, but the main reason
for the slow down is not due to call to libSystem, but rather the change from runtime.entersyscall() to runtime.entersyscallblock().
(The reason for the speed difference seems to be that the later handoff P without waiting for sysmon to do so as in the former.)

To summarize, on Darwin, Go 1.12 changes most syscall to proactively handoff P whereas Go 1.11- optimistically assuming
the syscall won't block for a long time and long blocks are handled by the sysmon. To me, this seems like an unintended
behavior change? Especially given that it slows down fast syscalls by 100%? I've sent CL 197938 for this issue.

Robert Engels

unread,
Sep 30, 2019, 3:30:21 PM9/30/19
to minux, golan...@googlegroups.com

That is inline with what I suggested that the "fast path" was no longer occurring. Thanks for your detailed investigation.
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Robert Engels

unread,
Oct 1, 2019, 9:02:54 AM10/1/19
to minux, golan...@googlegroups.com
Doesn’t Go reserve enough space to make the call using the routine stack - if not that seems like something it should be doing? Otherwise CGo in tight loops would be continually switching stacks?

> On Sep 30, 2019, at 1:01 AM, minux <minu...@gmail.com> wrote:
>
> --
> You received this message because you are subscribed to the Google Groups "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/CA%2Bdb%3Dn0R%2BRxE9S-BXK43ofnpgmi4J9N_4gRZoT2y6u_SWL6w%3Dw%40mail.gmail.com.

Bryan C. Mills

unread,
Oct 1, 2019, 9:12:00 AM10/1/19
to Robert Engels, minux, golan...@googlegroups.com
Goroutine stacks are in general much smaller than C stacks. Go functions have run-time stack checks and can reallocate (and relocate) the stack, whereas C functions generally rely on guard pages and immovable stacks.

I don't remember what happens in case of CGo calls in tight loops.

minux

unread,
Oct 1, 2019, 10:01:12 AM10/1/19
to golan...@googlegroups.com
On Tue, Oct 1, 2019 at 9:02 AM Robert Engels <ren...@ix.netcom.com> wrote:
Doesn’t Go reserve enough space to make the call using the routine stack - if not that seems like something it should be doing? Otherwise CGo in tight loops would be continually switching stacks?

C function generally assume a growable stack (implicitly triggered by memory access), which is incompatible with Go's
stack (generally smaller than C stack and its growth is managed by the Go runtime with explicit function frame size annotation.)

Also, the actual stack switching is pretty fast (swap a few registers with memory), and the majority of cgo overhead comes
from scheduler interaction.

Robert Engels

unread,
Oct 1, 2019, 12:52:52 PM10/1/19
to minux, golan...@googlegroups.com
Why would the CGo call in this case need to be handled any different than the original syscall (in terms of context switch, not stack)

-----Original Message-----
From: minux
Sent: Oct 1, 2019 9:00 AM
To: "golan...@googlegroups.com"
Subject: Re: [golang-dev] Re: performance regression

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

minux

unread,
Oct 1, 2019, 1:35:10 PM10/1/19
to golan...@googlegroups.com
On Tue, Oct 1, 2019 at 12:52 PM Robert Engels <ren...@ix.netcom.com> wrote:
Why would the CGo call in this case need to be handled any different than the original syscall (in terms of context switch, not stack)

They are handled almost exactly the same (in terms of things with major overhead, there are some minor
bookkeeping differences: cgo calls via runtime.cgocall and syscalls has slightly less bookkeeping than
cgocall because cgo call need to be prepared for cgo callback into Go, which won't happen with syscalls.)

Emmanuel Odeke

unread,
Oct 5, 2019, 8:31:00 PM10/5/19
to minux, golan...@googlegroups.com
Great patch Minux! That patch also fixed an issue with runtime, internal/poll consuming threads when dealing with non-blocking I/O
as reported in issues:

Given this program https://github.com/golang/go/issues/32326#issuecomment-526986444, before the libSystem CL https://go-review.googlesource.com/c/go/+/141639, on a GOMAXPROCS=8 machine, 8 threads were able to handle 100 os.Pipe reads and after
that CL they failed. With the change to use entersyscall() instead of entersyscallblock() we are now back to normal behavior.

I had put off today to examine what changed/went wrong but just examined patches and discovered that your CL fixed both issues,
thank you and to the reviewers as well!

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

robert engels

unread,
Oct 5, 2019, 9:01:56 PM10/5/19
to Emmanuel Odeke, minux, golan...@googlegroups.com

robert engels

unread,
Nov 6, 2019, 8:17:07 PM11/6/19
to Emmanuel Odeke, minux, golan...@googlegroups.com
I can confirm that this patch restored the performance of ‘keydb’ on OSX to that of the 1.9 release.
Reply all
Reply to author
Forward
0 new messages