Mutex profiling question

134 views
Skip to first unread message

Robert Engels

unread,
Jul 18, 2024, 11:14:55 AM (3 days ago) Jul 18
to golang-nuts
I found this issue, https://github.com/golang/go/issues/57071 and tried running with 

GODEBUG=profileruntimelocks=1

but it made no difference in the output.

I am using Go 1.22.4 on OSX.

Any ideas how I can determine what is causing this high contention shown below:

(pprof) tree
Showing nodes accounting for 4440.35ms, 100% of 4440.35ms total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context     
----------------------------------------------------------+-------------
 4324.17ms 97.38% 97.38%  4324.17ms 97.38%                | runtime._LostContendedRuntimeLock
----------------------------------------------------------+-------------
                                          116.17ms   100% |   sync.(*Map).Swap (inline)
  116.17ms  2.62%   100%   116.17ms  2.62%                | sync.(*Mutex).Unlock
----------------------------------------------------------+-------------
                                          116.17ms   100% |   github.com/robaho/go-trader/internal/exchange.(*grpcServer).massquote
         0     0%   100%   116.17ms  2.62%                | github.com/robaho/go-trader/internal/exchange.(*exchange).Quote
                                          116.17ms   100% |   github.com/robaho/go-trader/internal/exchange.sendMarketData

robert engels

unread,
Jul 18, 2024, 11:36:23 AM (3 days ago) Jul 18
to Robert Engels, golang-nuts
Looks like the name was changed to runtimecontentionstacks

Much better now…

(pprof) tree
Showing nodes accounting for 481.75ms, 100% of 481.75ms total
Dropped 39 nodes (cum <= 2.41ms)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context   
----------------------------------------------------------+-------------
                                          401.25ms 83.51% |   runtime.findRunnable (inline)
                                           35.26ms  7.34% |   runtime.sysmon (inline)
                                           17.37ms  3.62% |   runtime.stopm (inline)
                                           14.01ms  2.92% |   runtime.saveBlockEventStack (inline)
                                            6.07ms  1.26% |   runtime.goschedImpl (inline)
                                            1.52ms  0.32% |   runtime.schedule (inline)
                                            1.30ms  0.27% |   sync.(*Mutex).Unlock (inline)
                                            1.30ms  0.27% |   github.com/robaho/go-trader/internal/exchange.sendMarketData (inline)
                                            1.17ms  0.24% |   sync.(*Map).Swap (inline)
                                            1.16ms  0.24% |   github.com/robaho/go-trader/internal/exchange.(*grpcServer).Connection (inline)
  480.46ms 99.73% 99.73%   480.46ms 99.73%                | runtime.unlock

but this seems suspect, that finding the runnable should consume so much time. Is this a side effect of the scheduler, and it is really blocked in the unlock call waiting for another routine to be runnable?



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/25E80A1A-DFA8-4BA6-A49C-0F8F656DFA81%40me.com.

robert engels

unread,
Jul 18, 2024, 4:00:15 PM (3 days ago) Jul 18
to robert engels, golang-nuts
Anyone have any idea on the high internal lock time on findRunnable?

On Jul 18, 2024, at 10:36 AM, robert engels <ren...@ix.netcom.com> wrote:

Looks like the name was changed to runtimecontentionstacks

Rhys Hiltner

unread,
Jul 18, 2024, 7:31:33 PM (3 days ago) Jul 18
to golang-nuts
Hi Robert,

First, note that the contention profile for runtime-internal locks doesn't correctly blame the part of the code that _caused_ delay: with Go 1.22 and 1.23, the call stacks are of (runtime-internal) lock users that _experienced_ delay. That's https://go.dev/issue/66999, and those odd semantics are the reason that those stacks aren't shown by default (and instead require the GODEBUG). That was briefly fixed in late May, but caused some slowdowns and so couldn't be part of the Go 1.23 release.

Contention within findRunnable is probably on sched.lock, which protects the global portion of the scheduler. (Often, scheduling only needs to involve each P's local run queue.)

> Is this a side effect of the scheduler, and it is really blocked in the unlock call waiting for another routine to be runnable?

There really is a thread (an M with a P) that is blocked. It's not waiting for a G to be runnable .. depending on the exact part of findRunnable, it might be waiting to see whether a goroutine is runnable or not, or the M might be trying to go to sleep because it's determined that there's no work. The profile also shows runtime.goschedImpl, which means there's a G calling runtime.Gosched, and its M (and P) are blocked trying to add the G to the global run queue.

As for what's causing it ... consider how much work the program's goroutines do when they run (maybe "only tens of microseconds"), how and how often the program uses Gosched (maybe "a bit too much", though I can't quantify that), the setting for GOMAXPROCS (maybe "lots", though GOOS=darwin implies it won't be huge). Summing the runtime/metrics buckets for "/sched/latencies:seconds" and multiplying by 8 (runtime.gTrackingPeriod) will give an estimate of how many times the scheduler started running a goroutine ... or a runtime/trace can show all of those details at once.

Rhys

Robert Engels

unread,
Jul 19, 2024, 10:01:08 AM (2 days ago) Jul 19
to Rhys Hiltner, golang-nuts
Thank you for the very detailed explanation. 

What I am struggling with is that even if the routines do a little work and go back to sleep (which is probably the case here) - as I increase the number of external clients I would expect there is always a routine ready to run. 

What I am observing is that the throughput is capped. Eg with 10 clients each processes 3k requests a second, but with 20 clients each processes 1.5k requests a sec. Yet, the cpu usage is about the same and the IO rates about the same. Which points to scheduling delays - but I’ll confess after reading your response I still am not sure how to diagnose (what other tools I can look at - the trace seems too noisy). 

Is there some way to get high level scheduling statistics?

On Jul 18, 2024, at 6:31 PM, Rhys Hiltner <rhys.h...@gmail.com> wrote:



robert engels

unread,
Jul 19, 2024, 1:39:57 PM (2 days ago) Jul 19
to Rhys Hiltner, golang-nuts
I think I figured this out - the gotraceui tool was essential. The go trace seemed not to work, ie: all of the below are empty for some reason, but gotraceui correctly showed and reported all of the blocking events.




is reported for all of the types. I think is broken in Go 1.22.4 on OSX.

Rhys Hiltner

unread,
Jul 19, 2024, 5:56:22 PM (2 days ago) Jul 19
to robert engels, golang-nuts
> I think is broken in Go 1.22.4 on OSX.

Hmm, yes ... when I ask go1.22.4 on darwin/arm64 to create a trace and then use that version of "go tool trace" to view it, I see all four of those as empty too. When I ask go1.23rc2's "go tool trace" to view the data that go1.22.4 generated, they work / are non-empty.

> Which points to scheduling delays

Yes, though sometimes it's hard for tools to understand that the runtime's internal locks can be a non-zero contributor to delay. But I would expect the total scheduling delay to count total wall clock time, even when it includes "unusual" sources of latency such as contention on internal locks.

> the trace seems too noisy

There's a ton of data inside the trace, yes! I don't know your particular app, but I would try collecting an execution trace that overlaps in time with a CPU profile. Then I'd look for "interesting" CPU profile samples within the execution trace data stream (such as ones that show on-CPU time spent within runtime.lock2). I'd take the timestamps of those samples, and see if they're clustered at all. Then I'd look at the whole-process view of the execution trace, and zoom in on whatever time ranges appear (from the CPU profile samples) to be the most interesting.

> What I am observing is that the throughput is capped.

Yes, it looks like your app has reached the capacity of sched.lock. Are there ways to use it less, such as fewer calls to runtime.Gosched, having goroutines process work in larger batches, etc?

> Is there some way to get high level scheduling statistics?

That's "/sched/latencies:seconds" from the runtime/metrics package ... both the distribution of the delays (how long does scheduling take), and the sum of the counts (how often is scheduling required).
Reply all
Reply to author
Forward
0 new messages