Hi Go Devs,
We have a closed source program that many customers are running without issue, but one particular customer's usage patterns seem to be causing the process to completely lock up and become unresponsive to any input. Sending the process SIGQUIT shows, in the attached log, about 2900 goroutines, of which about 2000 are in "GC assist wait" state. (I have the full log with stack traces, which I'm happy share off-list with any Go developers). The customer has said that when the program is in this state, they've waited as long as 15 minutes and the application has not recovered, at which they've killed the the process.
They were previously running a build with Go 1.7.4 and observed the same locking up behavior, and we supplied them a custom build with Go 1.8, expecting that the fixes related to issue 16293 [1], would have prevented the stall.
I believe their hardware is a virtualized environment with 32GB of RAM. The machine has 40 cores and starts up without GOMAXPROCS environment variable set, but at run time our program calls `runtime.GOMAXPROCS(16)` because this software is licensed per CPU core. My next recommendation to the customer will be to set the GOMAXPROCS environment variable to 16, because I believe at one point I read that the Go runtime makes assumptions that the initial value of GOMAXPROCS doesn't change.
The customer's network is not accessible by my team, so thus far iterating on debugging this has been very slow. We haven't been able to reproduce the hang locally, so it's that much more difficult to debug on our end. We don't have any known data races, but we're considering asking them to run a race-detector-enabled build in case their workload exercises something we've missed.
Does this seem like a runtime bug? What should be our next steps here? We can ask the customer to set GODEBUG to whatever would produce the most useful logs. We are currently unable to make any significant code changes to reduce our allocations due to perceived risk of further changes.
Thanks,
Mark