Locking (`__GI___pthread_mutex_unlock`) takes most of the execution time when using cgo with large number of CPUs

211 views
Skip to first unread message

Bảo Phan Quốc

unread,
Jun 20, 2023, 8:32:25 AM6/20/23
to golang-nuts

I'm using cgo to call a C function from Go. Inside the C function there is a callback to a Go function. In other way, I'm calling Go -> C -> Go.

After running pprof, I noticed that the __GI___pthread_mutex_unlock took half of the execution time. AFAIK, cgo has an overhead, especially calling back from C to Go. But it's weird that cgo takes half of the execution time to do some locking. Is there something wrong in my code?

Here is the gist containing the code https://gist.github.com/phqb/2fa5bc76b77208d7f36b151b405e6dcc.

pprof output:

```

(pprof) top
Showing nodes accounting for 136.20s, 89.12% of 152.82s total
Dropped 252 nodes (cum <= 0.76s)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
    68.20s 44.63% 44.63%     68.20s 44.63%  __GI___pthread_mutex_unlock
    54.32s 35.55% 80.17%     54.32s 35.55%  __lll_lock_wait
     3.57s  2.34% 82.51%     57.89s 37.88%  __pthread_mutex_lock
     2.38s  1.56% 84.07%      3.15s  2.06%  runtime.casgstatus
     1.81s  1.18% 85.25%      1.81s  1.18%  runtime.futex
     1.26s  0.82% 86.08%      3.99s  2.61%  runtime.mallocgc
     1.21s  0.79% 86.87%      2.54s  1.66%  runtime.lock2
     1.21s  0.79% 87.66%      1.48s  0.97%  runtime.reentersyscall
     1.15s  0.75% 88.41%      1.15s  0.75%  runtime.procyield
     1.09s  0.71% 89.12%      1.59s  1.04%  runtime.exitsyscallfast

```

Running environment:

* Golang version: go version go1.20.5 linux/amd64

* `lscpu`:

```

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2200.152
BogoMIPS:                        4400.30
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-31
...
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
                                 ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpre
                                 fetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

```

Ian Lance Taylor

unread,
Jun 20, 2023, 10:04:43 AM6/20/23
to Bảo Phan Quốc, golang-nuts
On Tue, Jun 20, 2023 at 5:32 AM Bảo Phan Quốc <phq...@gmail.com> wrote:
>
> I'm using cgo to call a C function from Go. Inside the C function there is a callback to a Go function. In other way, I'm calling Go -> C -> Go.
>
> After running pprof, I noticed that the __GI___pthread_mutex_unlock took half of the execution time. AFAIK, cgo has an overhead, especially calling back from C to Go. But it's weird that cgo takes half of the execution time to do some locking. Is there something wrong in my code?

Every call from C to Go does acquire a mutex to check that the Go
runtime has been fully initialized. This is normally not a big deal
as the mutex is held only briefly. But your code does a lot of
parallel calls from C to Go. It is possible that the heavy contention
on the mutex is causing the results that you are seeing.

The code in question is the C function _cgo_wait_runtime_init_done in
runtime/gco/gcc_libinit.c. It might be possible to speed it up in
some way to avoid the mutex in the normal case.

Ian

robert engels

unread,
Jun 20, 2023, 6:36:46 PM6/20/23
to Ian Lance Taylor, Bảo Phan Quốc, golang-nuts
Since this is an initialization guard - can’t it simply be double checked with an atomic flag?
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAOyqgcU7Csb_hf5gvy8dVHKEHd8GFkN-fUML18uW0HiQGOAXyg%40mail.gmail.com.

Ian Lance Taylor

unread,
Jun 20, 2023, 6:38:56 PM6/20/23
to robert engels, Bảo Phan Quốc, golang-nuts
On Tue, Jun 20, 2023 at 3:36 PM robert engels <ren...@ix.netcom.com> wrote:
>
> Since this is an initialization guard - can’t it simply be double checked with an atomic flag?

That would be a good first thing to try.

Ian
Reply all
Reply to author
Forward
0 new messages