Go 1.8 (linux/amd64): goroutines stuck in "GC assist wait" for indeterminate length of time

ma...@influxdb.com

unread,

Feb 24, 2017, 4:30:00 PM2/24/17

to golang-dev, aus...@google.com

Hi Go Devs,

We have a closed source program that many customers are running without issue, but one particular customer's usage patterns seem to be causing the process to completely lock up and become unresponsive to any input. Sending the process SIGQUIT shows, in the attached log, about 2900 goroutines, of which about 2000 are in "GC assist wait" state. (I have the full log with stack traces, which I'm happy share off-list with any Go developers). The customer has said that when the program is in this state, they've waited as long as 15 minutes and the application has not recovered, at which they've killed the the process.

They were previously running a build with Go 1.7.4 and observed the same locking up behavior, and we supplied them a custom build with Go 1.8, expecting that the fixes related to issue 16293 [1], would have prevented the stall.

I believe their hardware is a virtualized environment with 32GB of RAM. The machine has 40 cores and starts up without GOMAXPROCS environment variable set, but at run time our program calls `runtime.GOMAXPROCS(16)` because this software is licensed per CPU core. My next recommendation to the customer will be to set the GOMAXPROCS environment variable to 16, because I believe at one point I read that the Go runtime makes assumptions that the initial value of GOMAXPROCS doesn't change.

The customer's network is not accessible by my team, so thus far iterating on debugging this has been very slow. We haven't been able to reproduce the hang locally, so it's that much more difficult to debug on our end. We don't have any known data races, but we're considering asking them to run a race-detector-enabled build in case their workload exercises something we've missed.

Does this seem like a runtime bug? What should be our next steps here? We can ask the customer to set GODEBUG to whatever would produce the most useful logs. We are currently unable to make any significant code changes to reduce our allocations due to perceived risk of further changes.

Thanks,

Mark

[1] https://github.com/golang/go/issues/16293

goroutines.txt

r...@golang.org

unread,

Feb 24, 2017, 5:49:34 PM2/24/17

to golang-dev, aus...@google.com, ma...@influxdb.com

>> We can ask the customer to set GODEBUG to whatever would produce the most useful logs.

If they could run with

GOTRACEBACK=crash and GODEBUG=gctrace=1

and if they can include the stack trace that would be great.

ja...@influxdb.com

unread,

Mar 10, 2017, 1:35:32 AM3/10/17

to golang-dev, aus...@google.com, ma...@influxdb.com

On Friday, February 24, 2017 at 3:49:34 PM UTC-7, r...@golang.org wrote:

>> We can ask the customer to set GODEBUG to whatever would produce the most useful logs.
If they could run with

GOTRACEBACK=crash and GODEBUG=gctrace=1

and if they can include the stack trace that would be great.

I'm able to reproduce the stall locally now. I've attached two files from runs with go1.7.4 and go1.8 using GOTRACEBACK=crash and GODEBUG=gctrace=1. Both stalled for several minutes and then recovered. In the go1.8, trace, it logged the following periodically while everything was stalled:

scvg3: 11 MB released

scvg3: inuse: 981, idle: 257, sys: 1239, released: 257, consumed: 981 (MB)

scvg4: 0 MB released

scvg4: inuse: 981, idle: 257, sys: 1239, released: 257, consumed: 981 (MB)

scvg5: 0 MB released

scvg5: inuse: 982, idle: 257, sys: 1239, released: 257, consumed: 982 (MB)

gc 66 @244.746s 0%: 13+718175+0.074 ms clock, 97+0.96/185/229+0.51 ms cpu, 584->822->632 MB, 911 MB goal, 8 P

After it recovered, it stalled again a few minutes later and I killed it to get a trace.

go1.7.4.txt

go1.8.txt

Austin Clements

unread,

Mar 10, 2017, 10:27:19 AM3/10/17

to Jason Wilder, golang-dev, Mark Rushakoff

Thanks! A quick skim of go1.8.txt reveals some interesting things:

* Goroutine 50 running on M 8 did the last bit of marking work and called gcMarkDone. I suspect this happened a while ago and everyone is stuck in assist wait because there's no more work to do. However, goroutine 50 isn't making any progress because it's stuck in forEachP trying to bring all of the goroutines to preemption points.

* Goroutine 45 running on M 9 is, I suspect, what forEachP is stuck on, since it's the only other running goroutine. This appears to be doing a type sliced copy, but the slice is only 16 bytes, so this shouldn't take any time. Unfortunately, the user part of that stack is gone, so all I can say is that it was started by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply at /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1033.

Given goroutine 45, this smells sort of like a non-preemptible loop that's doing lots of slice copies (runtime.typedslicecopy itself is not a preemption point), though that would have to be a huge loop to block it for over 15 minutes. Is it possible there's a loop doing a slice copy that's waiting on some atomic load to change (but never will because whatever it's waiting on is blocked in assist)?

Is there any possibility of attaching a debugger to the stalled process or to a core dump from the stalled process to see where the non-preemptible goroutine is stuck?

Austin Clements

unread,

Mar 10, 2017, 11:04:29 AM3/10/17

to Jason Wilder, golang-dev, Mark Rushakoff

I found why the runtime didn't print goroutine 45's stack, at least. I've filed a bug (https://golang.org/issue/19494) and sent a fix (https://golang.org/cl/38012), which I've marked for possible inclusion in Go 1.8.1.

ja...@influxdb.com

unread,

Mar 10, 2017, 3:52:29 PM3/10/17

to golang-dev, ja...@influxdb.com, ma...@influxdb.com

On Friday, March 10, 2017 at 9:04:29 AM UTC-7, Austin Clements wrote:

I found why the runtime didn't print goroutine 45's stack, at least. I've filed a bug (https://golang.org/issue/19494) and sent a fix (https://golang.org/cl/38012), which I've marked for possible inclusion in Go 1.8.1.

I re-ran with your fix and attached the trace. It looks like it is stuck in a loop copying a large slice as you described: https://github.com/influxdata/influxdb/blob/1.1/tsdb/engine/tsm1/encoding.gen.go#L143. This particular slice ended up with several million items during this stall which I have fixed in our code to avoid that situation and fix the stall.

Is it possible there's a loop doing a slice copy that's waiting on some atomic load to change (but never will because whatever it's waiting on is blocked in assist)?

There is a loop doing a large slice copy, but I don't understand what you mean by "waiting on some atomic load to change"?

I've tried recreating this stall using the same code and simulated data running outside of the server, but have not been successful yet. In the environment that I can re-create it, it seems to pause around 4M items in the slice consistently. My simulated test have no problem up to 40M items or more and much larger heap sizes. Any idea on what other factors I might be missing from my simulated test?

go1.8.a3bc3e1.txt

Austin Clements

unread,

Mar 13, 2017, 2:40:11 PM3/13/17

to Jason Wilder, golang-dev, Mark Rushakoff

On Fri, Mar 10, 2017 at 3:52 PM, <ja...@influxdb.com> wrote:

On Friday, March 10, 2017 at 9:04:29 AM UTC-7, Austin Clements wrote:
I found why the runtime didn't print goroutine 45's stack, at least. I've filed a bug (https://golang.org/issue/19494) and sent a fix (https://golang.org/cl/38012), which I've marked for possible inclusion in Go 1.8.1.

I re-ran with your fix and attached the trace. It looks like it is stuck in a loop copying a large slice as you described: https://github.com/influxdata/influxdb/blob/1.1/tsdb/engine/tsm1/encoding.gen.go#L143. This particular slice ended up with several million items during this stall which I have fixed in our code to avoid that situation and fix the stall.

Perfect. Actually, it looks like that's without the fix, but you got lucky and the SIGQUIT went to the thread that happened to be blocking GC.

Is it possible there's a loop doing a slice copy that's waiting on some atomic load to change (but never will because whatever it's waiting on is blocked in assist)?

There is a loop doing a large slice copy, but I don't understand what you mean by "waiting on some atomic load to change"?

Sorry. Looking back, I had misinterpreted your earlier trace. I thought it was stuck copying a really tiny slice, but what I was reading as the slice size was just the element size. In fact, the earlier trace doesn't show how large the trace is because the stack got cut off. The new trace (go1.8.a3bc3e1.txt) shows you're copying 999 16 byte elements (~16K).

This is still small, obviously. However, now that we know what's calling this, I can say that the copy is, in fact, happening in a potentially non-preemptible loop. The loop you pointed at in encoding.gen.go has a few potential preemption points: the calls to UnixNano, the calls to append, and the call to sort.Search. However, all of the UnixNano implementations are so small that they become auto-nosplit, so they're not actually preemption points. The calls to append are only preemption points if they actually grow the slice. The call to sort.Search is definitely a preemption point, but it's conditional.

Non-preemptible loops are a known problem (and have been for some time) and we're hoping to finally fix them in Go 1.9 (https://github.com/golang/go/issues/10958).

That said, while I'm obviously not very familiar with your system, this whole loops smells distinctly O(n^2) to me. If it's possible for it to exhibit O(n^2) behavior and for the stars to align such that it never calls sort.Search or grows a slice with append, then I could definitely see this blocking progress for some time.

I've tried recreating this stall using the same code and simulated data running outside of the server, but have not been successful yet. In the environment that I can re-create it, it seems to pause around 4M items in the slice consistently. My simulated test have no problem up to 40M items or more and much larger heap sizes. Any idea on what other factors I might be missing from my simulated test?

It may be that your simulation is triggering the sort.Search or the slice growth.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward