CGO: Performance and Batching

Skip to first unread message

Stephen Baynham

Aug 13, 2021, 1:04:07 PM8/13/21
to golang-dev
CGO performance has come a long way recently.  A benchmark performed on go 1.15 showed 60ns of overhead for calls into C ( - I'm getting 30ms in my own dummy benchmark, with an additional 55ns overhead if I add a callback to return to go (this includes a ~25% improvement I'm seeing in 1.17rc2 over 1.15).  This is a far cry from where it was some time ago, but it's still an order of magnitude higher than the JNI and an order and a half over rust, and that really does matter for some applications.

The linked benchmark makes it clear that over 80% of this overhead is being in a lock waiting on the scheduler, which is why it has been proposed a few times that engineers be permitted to call into cgo without scheduler coordination.  The go maintainers are very reticent to allow this for the understandable reason that they want the go runtime to be guaranteed safe.

So my question is: what if instead of eliminating scheduler coordination, we allowed it to happen earlier?  The primary advice for cgo performance right now is to batch calls as much as possible to incur the scheduler hit once for several calls into C.  If engineers could convert a goroutine into a C-safe thread on command- and keep it that way for the callbacks made from go into C- the easy batching would pretty much get go over the finish line in terms of parity with other FFIs.

There's probably some really good reasons why this isn't possible, but I'm curious what they are.  Would go features be lost if a goroutine was permitted to stop working with the scheduler, and which ones?  Would it just not be the nice green threads we're used to, would we not be allowed to start new goroutines or wait on channels?  Is it just too labor-intensive?  Curious what people think.

Ian Lance Taylor

Aug 13, 2021, 1:33:29 PM8/13/21
to Stephen Baynham, golang-dev
I suppose I don't understand how this could work. A goroutine that is
separate from the scheduler wouldn't be able to do anything that
requires scheduling interaction. That includes allocating memory,
storing a pointer into Go memory (which may require a write barrier),
sending on or receiving from a channel, and so forth. It would be an
extremely limited version of Go, and it would be essentially
impossible for any non-expert to write such code.

As far as I can tell the cgo calling sequence does not acquire the
scheduler lock. I ran perf on a simple cgo call and in my
measurements using perf the hottest line is the atomic.Cas in
runtime.casgstatus. It's not 80% of the time in my measurements, it's
more like 19%. But still. Then 9% of the time seems to be taken by
the `atomic.Store(&pp.status, _Psyscall)` in runtime.reentersyscall.
I don't know why these seem slow, as I would expect these atomic
operations to be uncontended.


Stephen Baynham

Aug 13, 2021, 2:31:20 PM8/13/21
to golang-dev
Since I posted this I've read a bit more into the proc code and something did catch my eye: there are a few places in the code that call a method that atomic.load's the current g state. Given that the state is an int32 this seems to be adding lock contention for no benefit? To satisfy the race detector? Curious what the throughput on that method is, if it's adding unexpected contention.

Stephen Baynham

Aug 15, 2021, 5:57:25 PM8/15/21
to golang-dev
A little bit of spare data - the amount of time spent on exitsyscallfast seems to be much higher on my mac than my windows machine- it accounts for ~20% of cgocall time on mac and ~10% on windows.  This could just be a hardware distinction, my mac is a laptop and my windows is not.

The exitsyscallfast time is spent on a Cas of p.status in exitsyscallfast, and a naked comparison of p.status in wirep.  Regarding the latter I'm a little out of my depth here, but I now believe atomic operations like atomic.Cas and atomic.Load/Store can block naked reads/writes as well (making my previous post kind of boneheaded)? 

I am going to try to understand the source of contention on these variables a little bit better, if I can.  

Stephen Baynham

Aug 16, 2021, 3:26:53 AM8/16/21
to golang-dev
I added some instrumentation to the code that accesses g.atomicstatus and ran a benchmark on my laptop, and I didnt see anything I considered unexpected:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkAdd-12         205843186               55.21 ns/op
MIN 50 MAX 11440
p50 50 Avg 50.00003258887312
p75 60 Avg 51.946954873970135
p90 60 Avg 53.289127117643694
p99 80 Avg 54.10445756631508
Tail Excess % 3.2540811985772162
CopyStack 0
FromPreempted 9
FromScanStatus 348
GStatus 613709848
ReadStatus 46374
ToPreemptScan 9
ToScanStatus 336
ok 17.331s

I thought maybe some extra wait time was being added by CAS failing and running the inside of the casgstatus loop, but:

GStatus 593394585
GStatusMiss 0

Nope!  However, I am getting multiple casgstatus calls per benchmark iteration.  That's not so surprising, since I"m using an HRTime implementation, etc.  So I decided to remove the HRTime calls and break down casgstatus calls per iteration:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkAdd-12         212181692               53.51 ns/op
MIN 1.99 MAX 2.75
p50 2 Avg 2.0000012707875947
p75 2 Avg 2.0000008471917297
p90 2 Avg 2.000000705992958
p99 2 Avg 2.000000641811883
Tail Excess % 1.0002719476092374
CopyStack 0
FromPreempted 9
FromScanStatus 345
GStatus 626386807
GStatusMiss 3
ReadStatus 43752
ToPreemptScan 9
ToScanStatus 335

2 casgstatus calls per iteration seems... fine?  So the only question remaining is: is ~5ns just how long atomic.Cas takes to run?

Stephen Baynham

Aug 16, 2021, 6:29:36 PM8/16/21
to golang-dev
My theory right now is that the performance here is vaguely related to:

Here are things I think might be true after some research:
 * Ms are really likely to context switch coming back from a syscall (
 * LOCK instruction performance depends a lot on cache coherence- we can do a LOCK instruction if same thread same core was the last person to access the same value but if not it can take over a hundred instructions (

If correct, we would expect scheduler atomics to be slow immediately after a cgo call as the cache gets rebuilt.  Making several cgo calls would be slow.  I intend to find a way of measuring whether repeated cgo calls actually create a lot of context switches, not sure how to do it without linux.

Stephen Baynham

Aug 18, 2021, 12:00:37 AM8/18/21
to golang-dev
None of the above panned out, so I made my own little benchmark for LOCK; CMPXCHGL:

BenchmarkCas-12         184787934                6.202 ns/op

Okay, probably would have been easier to test that first.  This is on my mac- that's about 11% of the total cgocall runtime on this box, so if we're averaging two calls to casgstatus per cgo call, then 19% is reasonable.  And XCHGL has an implicit lock in it, or so I've read.  It's getting similar results of around 6ns, which explains the performance of Store/exitsyscall.

So there's nothing actually suspicious going on here, just a lot of atomics usage in a path that we would prefer to be faster than atomics can provide.

Reply all
Reply to author
0 new messages