Go routines stuck in runtime_pollwait

2,889 views
Skip to first unread message

Siddhesh Divekar

unread,
Aug 22, 2020, 3:06:18 PM8/22/20
to golang-nuts
Hi All,

We saw an issue with our process running in k8s on ubuntu 18.04.
Most of the go routines are stuck for several minutes in http/http2 net code.

Have you seen similar issues ?

goroutine 2800143 [select, 324 minutes]: net/http.(*persistConn).readLoop(0xc00187d440) /usr/local/go/src/net/http/transport.go:2032 +0x999 created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1580 +0xb0d goroutine 2738894 [IO wait, 352 minutes]: internal/poll.runtime_pollWait(0x7f5b61b280c0, 0x72, 0xffffffffffffffff) /usr/local/go/src/runtime/netpoll.go:184 +0x55 internal/poll.(*pollDesc).wait(0xc0017e7e18, 0x72, 0x1000, 0x1000, 0xffffffffffffffff) /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45 internal/poll.(*pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:92 internal/poll.(*FD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, 0x0, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:169 +0x1cf net.(*netFD).Read(0xc0017e7e00, 0xc0044a9000, 0x1000, 0x1000, 0xc0026359e8, 0x49d7fd, 0xc0017e7e00) /usr/local/go/src/net/fd_unix.go:202 +0x4f net.(*conn).Read(0xc0000db8b8, 0xc0044a9000, 0x1000, 0x1000, 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:184 +0x68 net/http.(*connReader).Read(0xc004a4fec0, 0xc0044a9000, 0x1000, 0x1000, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/server.go:785 +0xf4 bufio.(*Reader).fill(0xc003f1a360) /usr/local/go/src/bufio/bufio.go:100 +0x103 bufio.(*Reader).Peek(0xc003f1a360, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc002635ad0) /usr/local/go/src/bufio/bufio.go:138 +0x4f net/http.(*conn).readRequest(0xc0028e1d60, 0x393ed20, 0xc0024e9780, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/server.go:962 +0xb3b net/http.(*conn).serve(0xc0028e1d60, 0x393ed20, 0xc0024e9780) /usr/local/go/src/net/http/server.go:1817 +0x6d4 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2928 +0x384

Is there a know issue or something obvious from the backtrace here.

Ian Lance Taylor

unread,
Aug 22, 2020, 3:31:06 PM8/22/20
to Siddhesh Divekar, golang-nuts
It's entirely normal for goroutines to sit in pollWait if they are
waiting for network I/O. There may be reasons why this is incorrect
for your program, but you'll have to tell us those reasons.

Also, along with those reasons, please tell us the version of Go and
the exact environment that you are running. Thanks.

Ian

Siddhesh Divekar

unread,
Aug 24, 2020, 10:09:56 AM8/24/20
to Ian Lance Taylor, golang-nuts
Hi Ian,

Thanks for replying. 

We have a go server running which handles user requests & collects data from various sources like gcp's cloud sql and big query.
We are using shopify's sarama library to do kafka operations.

There are seeing lots of go routines in waiting state for several minutes. 
Over the period of time around 587 goroutines have been spun up.

We see that two go routines are stuck on gcp big query and we are using wait groups there.
However, it's not clear why that would cause all other go routines to get hung & make cpu go high.

goroutine 3332131 [semacquire, 79 minutes]:
sync.runtime_Semacquire(0xc001c4fcf8)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc001c4fcf0)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, 0xc002912c00, 0x330e1b0, 0xf, 0xc002912cf0, 0x3)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:488 +0x1d7
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, 0xc002912db8, 0xc001870e68, 0x746121, 0xc0010fcaf8, 0x17)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:567 +0xa0d
git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, 0xc002356810, 0x24, 0x32f7b78, 0x5)
        /builds/fusionio/fusion/controller/stats/prefetcher.go:77 +0xf2
created by git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats
        /builds/fusionio/fusion/controller/stats/prefetcher.go:100 +0xd8


goroutine 3332149 [semacquire, 79 minutes]:
sync.runtime_Semacquire(0xc0015ede48)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0015ede40)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, 0xc00249dc00, 0x330e1b0, 0xf, 0xc00249dcf0, 0x3)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:488 +0x1d7
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, 0xc00249ddb8, 0xc003200668, 0xc00407a520, 0xc003200590, 0x46ee97)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:567 +0xa0d
git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, 0xc002356ba0, 0x24, 0x32f7b78, 0x5)
        /builds/fusionio/fusion/controller/stats/prefetcher.go:77 +0xf2
created by git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats
        /builds/fusionio/fusion/controller/stats/prefetcher.go:100 +0xd8

I found the link below which kind of co-relates to our scenario.

Most of the go routines in the backtrace are in a net/http package so our suspicion is that above bug in our code might be causing that.
Even the bigquery is getting hung in net/http.

We are using go version 1.13.8 & are running on gcp kubernetes cluster on ubuntu 18.04 docker.

go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/builds/prosimoio/prosimo/pdash/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build048009048=/tmp/go-build -gno-record-gcc-switches"

Let me know if any other information is needed.

--
-Siddhesh.
backtrace.txt

Robert Engels

unread,
Aug 24, 2020, 3:54:51 PM8/24/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
Go routines in a waiting state should not be consuming CPU. Are you certain they are not in constant transition from waiting to processing - this could show up as high CPU usage while everything looks blocks. 

I would use pprof - github.com/robaho/goanalyzer might be of assistance here to see the actual work being done. 

On Aug 24, 2020, at 9:10 AM, Siddhesh Divekar <siddhesh...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAMjfk%2BggC%2BwgwM_%2BM5ML0SKD3qJphCrif%3D4c2AqB9v6n%2Btw5Jw%40mail.gmail.com.
<backtrace.txt>

Siddhesh Divekar

unread,
Aug 24, 2020, 7:12:05 PM8/24/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Looking at the no. of go routines we have does this apply to this issue ?
--
-Siddhesh.

Siddhesh Divekar

unread,
Aug 24, 2020, 7:16:12 PM8/24/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Hi Robert,

Sorry I missed your earlier response.

From what we saw our UI was blocked and since everything was unresponsive
we had to recover the system by sending sig abort. 
--
-Siddhesh.

Robert Engels

unread,
Aug 24, 2020, 8:38:16 PM8/24/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
I think it is too hard to tell with the limited information. It could be exhausted connections or it could be thrashing (the claim of high cpu)

I think you want to run profiling capture prior to hitting the stick state - you should be able to detect what is happening. 

If the referenced issue is related I would assume you should be able to connect by forcing http/1. 

You can also try disabling http/2 and see if your issue goes away. 

Siddhesh Divekar

unread,
Aug 24, 2020, 11:30:59 PM8/24/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Robert,

We will do the profiling next time we hit the issue again & see what is happening.
This was the first time we saw the issue & don't want to get rid of http2 advantages without making sure it's the actual culprit.

Do you think in the meanwhile we should do what the discussion below suggests anyways ?
--
-Siddhesh.

Robert Engels

unread,
Aug 25, 2020, 6:26:26 AM8/25/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
The tcp protocol allows the connection to wait for hours. Go routines stuck in wait do not burn CPU. Are the clients local or remote (over internet)?

On Aug 24, 2020, at 10:29 PM, Siddhesh Divekar <siddhesh...@gmail.com> wrote:



Siddhesh Divekar

unread,
Aug 25, 2020, 10:24:20 AM8/25/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Clients are over the internet.
--
-Siddhesh.

Robert Engels

unread,
Aug 25, 2020, 2:26:21 PM8/25/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
Are you transferring a lot of data? Are the servers non-cloud hosted? 

You could be encountering “tcp stalls”. 

On Aug 25, 2020, at 9:24 AM, Siddhesh Divekar <siddhesh...@gmail.com> wrote:



Siddhesh Divekar

unread,
Aug 25, 2020, 4:15:50 PM8/25/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Both servers and data sources are in the cloud.
I would not say a lot of data, it's precomputed data which shouldn't take that long.

--
-Siddhesh.

Siddhesh Divekar

unread,
Aug 26, 2020, 11:09:59 AM8/26/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Robert,

I assume we can safely add these timeouts based on what we expect
should be a reasonable timeout for our clients ?

s.ReadTimeout = expTimeOut * time.Second
s.WriteTimeout = expTimeOut * time.Second
--
-Siddhesh.

robert engels

unread,
Aug 26, 2020, 8:03:10 PM8/26/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
That should allow your server to clean up “dead” clients. Typically you use this in conjunction with a ‘keep alive’ in the protocol.

I am doubtful that a bunch of dead clients hanging around would cause a CPU spike. You really don’t have too many Go routines/connections involved (I’ve worked with 1000’s of live connections).

I would look at something else… I am guessing your synchronization is incorrect and your threads are blocking, and you have a few that are spinning expecting something to happen that never will (because the source of the event is blocked on the mutex/lock).



Siddhesh Divekar

unread,
Aug 26, 2020, 9:27:44 PM8/26/20
to robert engels, Ian Lance Taylor, golang-nuts
Robert,

That's where the confusion is.
From the backtrace we see that two go routines are blocked on the waitgroup.
The go routine on which they are blocked is waiting on big query to return.

On every user request we create a new go routine so they shouldn't get blocked because of these two.

However, we do see one go routine blocked on futex. Is there a way to know from where it originated from ?
When does golang use futex under the hold so that we can investigate from that angle.

goroutine 0 [idle]:
runtime.futex(0x5b607a8, 0x80, 0x0, 0x0, 0x0, 0x7ffc5dd1bd20, 0x437640, 0xc003638038, 0x7ffc5dd1bd48, 0x40c78f, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:535 +0x21
runtime.futexsleep(0x5b607a8, 0x0, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os_linux.go:44 +0x46
runtime.notesleep(0x5b607a8)
        /usr/local/go/src/runtime/lock_futex.go:151 +0x9f
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1928 +0xc0
runtime.exitsyscall0(0xc003638000)
        /usr/local/go/src/runtime/proc.go:3140 +0x111
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b
--
-Siddhesh.

Robert Engels

unread,
Aug 26, 2020, 10:42:48 PM8/26/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
The big query client may be “single threaded” meaning you may need a pool of connections. Not sure - haven’t used the BQ library from Go. 

If that’s the case a single slow query will block all of your connections (if they are all accessing BQ). Plus there are BQ rate limits - when you hit those the connector may block - again blocking all clients. 

Robert Engels

unread,
Aug 26, 2020, 10:45:47 PM8/26/20
to Siddhesh Divekar, Ian Lance Taylor, golang-nuts
If you look at the stack trace the futex is because it is trying to shutdown the entire process - killing all of the M’s - probably because you sent the sigabort 

On Aug 26, 2020, at 9:41 PM, Robert Engels <ren...@ix.netcom.com> wrote:



Siddhesh Divekar

unread,
Aug 26, 2020, 11:51:33 PM8/26/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Right, then it looks less likely that we are blocked on a mutex.

Every 2 minutes we spin up a go routine which then in turn spins up a bunch of go routines to collect data from big query & elastic (data collector routines).
The 2 minutes go routine then in turn waits on waitgroup for data collector routines to come back. So its not single go routine from our side at least.
From backtrace we have from sigabort we see only one such data collector go routine blocked and other 2 2 mins thread waiting on waitgroup.

I see similar issues on the internet. Kubernetes hit the following issue which has exactly the same signature as us.


I will follow up with big query guys to see what their behavior is. If one of the requests gets stuck on big query, what do they do with other requests?
But in that case in our back trace we should see all other requests waiting in their code which is not the case.


--
-Siddhesh.

Siddhesh Divekar

unread,
Aug 27, 2020, 12:13:27 AM8/27/20
to Robert Engels, Ian Lance Taylor, golang-nuts
Typo.

From backtrace we have from sigabort we see only 2 such data collector go routine blocked and other 2 2 mins thread waiting on waitgroup.
 
--
-Siddhesh.

Kurtis Rader

unread,
Aug 27, 2020, 12:29:38 AM8/27/20
to Siddhesh Divekar, Robert Engels, Ian Lance Taylor, golang-nuts
On Wed, Aug 26, 2020 at 8:51 PM Siddhesh Divekar <siddhesh...@gmail.com> wrote:
Right, then it looks less likely that we are blocked on a mutex.

Every 2 minutes we spin up a go routine which then in turn spins up a bunch of go routines to collect data from big query & elastic (data collector routines).
The 2 minutes go routine then in turn waits on waitgroup for data collector routines to come back. So its not single go routine from our side at least.
From backtrace we have from sigabort we see only one such data collector go routine blocked and other 2 2 mins thread waiting on waitgroup.

Are you spinning up a Go routine every 2 minutes regardless of whether the previous instance had completed? Your comment is not clear whether you have one, or more than one, goroutine issuing a BigQuery query at the time you saw the problem and captured the backtraces. If you are, in fact, starting those Go routines without regard to whether the prior instance had completed that seems like a "yellow flag", "here be dragons", situation. 

--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

Siddhesh Divekar

unread,
Aug 27, 2020, 1:48:48 AM8/27/20
to Kurtis Rader, Robert Engels, Ian Lance Taylor, golang-nuts
That's right Kurtis, we don't wait or check if the prior go routine which fired the big query has completed or not. 
The rationale there was the data which comes out from big query is not much and should not take more than 2 mins.

To be precise every 2 mins we will fire 6 independent Big queries & 24 independent elastic queries each in its own go routine.
The data coming out is cached locally and when the user requests that data we serve it out of cache.

What we are unable to understand is why would this cause http server to stop responding to user requests which are not coming for this data.
Regular requests were also not getting served. Since the system was responding we had to send sig abort and collect the back trace.


goroutine 3332366 [select, 79 minutes]:
net/http.(*http2ClientConn).roundTrip(0xc003759b00, 0xc00050b700, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7573 +0x9a0
net/http.(*http2Transport).RoundTripOpt(0xc0000a0480, 0xc00050b700, 0xc0037a8600, 0x74aba5, 0x0, 0xc0001c8000)
        /usr/local/go/src/net/http/h2_bundle.go:6936 +0x172
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6897
net/http.http2noDialH2RoundTripper.RoundTrip(0xc0000a0480, 0xc00050b700, 0xc0047e97a0, 0x5, 0xc0001c8088)
        /usr/local/go/src/net/http/h2_bundle.go:9032 +0x3e
net/http.(*Transport).roundTrip(0x5b48c00, 0xc00050b700, 0x2b3ce20, 0x5b48801, 0xc0048deb00)
        /usr/local/go/src/net/http/transport.go:485 +0xdc8
net/http.(*Transport).RoundTrip(0x5b48c00, 0xc00050b700, 0x3301658, 0xa, 0xc0048debd0)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
google.golang.org/api/transport/http.(*parameterTransport).RoundTrip(0xc0028d9e40, 0xc00050b600, 0x5, 0x5, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/transport/http/dial.go:145 +0x224
go.opencensus.io/plugin/ochttp.(*traceTransport).RoundTrip(0xc00284d6c0, 0xc00050b600, 0xc002080980, 0x1, 0x1)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/go.openc...@v0.22.3/plugin/ochttp/trace.go:84 +0x456
go.opencensus.io/plugin/ochttp.statsTransport.RoundTrip(0x38d4e60, 0xc00284d6c0, 0xc00050b400, 0xc0025e27e0, 0xc0025e27e0, 0x2ecb720)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/go.openc...@v0.22.3/plugin/ochttp/client_stats.go:57 +0x5cc
go.opencensus.io/plugin/ochttp.(*Transport).RoundTrip(0xc003731d60, 0xc00050b400, 0x0, 0x0, 0x150)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/go.openc...@v0.22.3/plugin/ochttp/client.go:99 +0x20e
golang.org/x/oauth2.(*Transport).RoundTrip(0xc00376bc20, 0xc00050b300, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/golang.org/x/oau...@v0.0.0-20200107190931-bf48bf16ab8d/transport.go:55 +0xf4
net/http.send(0xc00050b300, 0x38d5060, 0xc00376bc20, 0x0, 0x0, 0x0, 0xc0036a64c8, 0x1d6cede, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:250 +0x443
net/http.(*Client).send(0xc003afdb60, 0xc00050b300, 0x0, 0x0, 0x0, 0xc0036a64c8, 0x0, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:174 +0xfa
net/http.(*Client).do(0xc003afdb60, 0xc00050b300, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:641 +0x3ce
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:509
google.golang.org/api/internal/gensupport.send(0x393ede0, 0xc00346b8c0, 0xc003afdb60, 0xc00050b200, 0x5b7fee0, 0xc0037a9300, 0x1d6f876)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/internal/gensupport/send.go:67 +0xef
google.golang.org/api/internal/gensupport.SendRequest(0x393ede0, 0xc00346b8c0, 0xc003afdb60, 0xc00050b200, 0xc0037a95e8, 0x84, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/internal/gensupport/send.go:52 +0x193
google.golang.org/api/bigquery/v2.(*JobsGetQueryResultsCall).doRequest(0xc0037a98f8, 0x32f6d9c, 0x4, 0x0, 0x0, 0x203001)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/bigquery/v2/bigquery-gen.go:7768 +0x765
google.golang.org/api/bigquery/v2.(*JobsGetQueryResultsCall).Do(0xc0037a98f8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/bigquery/v2/bigquery-gen.go:7780 +0xa8
cloud.google.com/go/bigquery.(*Job).waitForQuery.func1(0x6d183a, 0xc0037a97c0, 0xc0037a97e0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigq...@v1.4.0/job.go:320 +0x48
cloud.google.com/go/internal.retry(0x393ede0, 0xc00346b8c0, 0x3b9aca00, 0xdf8475800, 0x4000000000000000, 0x0, 0xc0037a98e0, 0x3433818, 0x0, 0x203000)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/g...@v0.53.0/internal/retry.go:38 +0x48
cloud.google.com/go/internal.Retry(...)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/g...@v0.53.0/internal/retry.go:31
cloud.google.com/go/bigquery.(*Job).waitForQuery(0xc004080660, 0x393ede0, 0xc00346b8c0, 0xc001b74ca0, 0x7, 0x393ed60, 0xc0000d8018, 0x2d62240, 0x5b7e118, 0x3242840, ...)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigquery@v1.4.0/job.go:319 +0x4df
cloud.google.com/go/bigquery.(*Job).read(0xc004080660, 0x393ede0, 0xc00346b8c0, 0xc0037a9b38, 0x3430808, 0x0, 0x0, 0x393ede0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigquery@v1.4.0/job.go:293 +0xbd
cloud.google.com/go/bigquery.(*Job).Read(0xc004080660, 0x393ed60, 0xc0000d8018, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigquery@v1.4.0/job.go:280 +0x14c
cloud.google.com/go/bigquery.(*Query).Read(0xc003615ca0, 0x393ed60, 0xc0000d8018, 0xc003743a00, 0x1, 0x1)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigquery@v1.4.0/query.go:333 +0x91
git.fusion.io/fusionio/fusion/utils.git/bigquery.(*BigQueryWrapper).GetTemplateSQL(0xc003afdbc0, 0x331199e, 0x11, 0xc003afda70, 0x393f0e0, 0xc00010e1c0, 0xc003afdbc0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/git.fusion.io/fusionio/fusion/util...@v0.0.0-20200813012118-9675cdb4e965/bigquery/bigquery.go:105 +0x25d
git.fusion.io/fusionio/fusion/controller.git/stats.(*StatsDao).executeQuery(0xc000010418, 0xc0028c3000, 0x331199e, 0x11, 0x24, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/stats/statsdao.go:401 +0xe3
git.fusion.io/fusionio/fusion/controller.git/stats.(*StatsDao).GetTeamAppExpScore(...)
        /builds/fusionio/fusion/controller/stats/statsdao.go:545
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetTeamAppExpScore(0xc001b54d40, 0xc0028c3000, 0x0, 0x42fd7a, 0x0, 0x0)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:459 +0x4e
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runQuery(0xc001b54d40, 0x330e1b0, 0xf, 0xc003da9870, 0xc0024ef080, 0xc001c4fcf0)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:464 +0x7e
created by git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:485 +0x1a4

goroutine 3332154 [select, 79 minutes]:
net/http.(*http2ClientConn).roundTrip(0xc003759b00, 0xc00050a500, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7573 +0x9a0
net/http.(*http2Transport).RoundTripOpt(0xc0000a0480, 0xc00050a500, 0xc002de4600, 0x74aba5, 0x0, 0xc0001c8000)
        /usr/local/go/src/net/http/h2_bundle.go:6936 +0x172
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6897
net/http.http2noDialH2RoundTripper.RoundTrip(0xc0000a0480, 0xc00050a500, 0xc0047e95f0, 0x5, 0xc0001c8088)
        /usr/local/go/src/net/http/h2_bundle.go:9032 +0x3e
net/http.(*Transport).roundTrip(0x5b48c00, 0xc00050a500, 0x2b3ce20, 0x5b48801, 0xc00170db80)
        /usr/local/go/src/net/http/transport.go:485 +0xdc8
net/http.(*Transport).RoundTrip(0x5b48c00, 0xc00050a500, 0x3301658, 0xa, 0xc00170dc20)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
google.golang.org/api/transport/http.(*parameterTransport).RoundTrip(0xc0020a7e00, 0xc00050a400, 0x5, 0x5, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/transport/http/dial.go:145 +0x224
go.opencensus.io/plugin/ochttp.(*traceTransport).RoundTrip(0xc00284c540, 0xc00050a400, 0xc000eb78a0, 0x1, 0x1)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/go.openc...@v0.22.3/plugin/ochttp/trace.go:84 +0x456
go.opencensus.io/plugin/ochttp.statsTransport.RoundTrip(0x38d4e60, 0xc00284c540, 0xc00050a200, 0xc00420b910, 0xc00420b910, 0x2ecb720)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/go.openc...@v0.22.3/plugin/ochttp/client_stats.go:57 +0x5cc
go.opencensus.io/plugin/ochttp.(*Transport).RoundTrip(0xc00400d360, 0xc00050a200, 0x0, 0x0, 0x150)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/go.openc...@v0.22.3/plugin/ochttp/client.go:99 +0x20e
golang.org/x/oauth2.(*Transport).RoundTrip(0xc0022aaee0, 0xc00050a100, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/golang.org/x/oau...@v0.0.0-20200107190931-bf48bf16ab8d/transport.go:55 +0xf4
net/http.send(0xc00050a100, 0x38d5060, 0xc0022aaee0, 0x0, 0x0, 0x0, 0xc0036a6310, 0x1d6cede, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:250 +0x443
net/http.(*Client).send(0xc003b8e540, 0xc00050a100, 0x0, 0x0, 0x0, 0xc0036a6310, 0x0, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:174 +0xfa
net/http.(*Client).do(0xc003b8e540, 0xc00050a100, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:641 +0x3ce
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:509
google.golang.org/api/internal/gensupport.send(0x393ede0, 0xc0031bb710, 0xc003b8e540, 0xc00050a000, 0x5b7fee0, 0xc002de5300, 0x1d6f876)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/internal/gensupport/send.go:67 +0xef
google.golang.org/api/internal/gensupport.SendRequest(0x393ede0, 0xc0031bb710, 0xc003b8e540, 0xc00050a000, 0xc002de55e8, 0x84, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/internal/gensupport/send.go:52 +0x193
google.golang.org/api/bigquery/v2.(*JobsGetQueryResultsCall).doRequest(0xc002de58f8, 0x32f6d9c, 0x4, 0x0, 0x0, 0x203000)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/bigquery/v2/bigquery-gen.go:7768 +0x765
google.golang.org/api/bigquery/v2.(*JobsGetQueryResultsCall).Do(0xc002de58f8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/google.golang.org/a...@v0.19.0/bigquery/v2/bigquery-gen.go:7780 +0xa8
cloud.google.com/go/bigquery.(*Job).waitForQuery.func1(0x6d183a, 0xc002de57c0, 0xc002de57e0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigquery@v1.4.0/job.go:320 +0x48
cloud.google.com/go/internal.retry(0x393ede0, 0xc0031bb710, 0x3b9aca00, 0xdf8475800, 0x4000000000000000, 0x0, 0xc002de58e0, 0x3433818, 0x0, 0x203000)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/g...@v0.53.0/internal/retry.go:38 +0x48
cloud.google.com/go/internal.Retry(...)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/g...@v0.53.0/internal/retry.go:31
cloud.google.com/go/bigquery.(*Job).waitForQuery(0xc003491980, 0x393ede0, 0xc0031bb710, 0xc00233d700, 0x7, 0x393ed60, 0xc0000d8018, 0x2d62240, 0x5b7e118, 0x3242840, ...)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigq...@v1.4.0/job.go:319 +0x4df
cloud.google.com/go/bigquery.(*Job).read(0xc003491980, 0x393ede0, 0xc0031bb710, 0xc002de5b38, 0x3430808, 0x0, 0x0, 0x393ede0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigq...@v1.4.0/job.go:293 +0xbd
cloud.google.com/go/bigquery.(*Job).Read(0xc003491980, 0x393ed60, 0xc0000d8018, 0x0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigq...@v1.4.0/job.go:280 +0x14c
cloud.google.com/go/bigquery.(*Query).Read(0xc003c79ca0, 0x393ed60, 0xc0000d8018, 0xc001e36200, 0x1, 0x1)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/cloud.google.com/go/bigq...@v1.4.0/query.go:333 +0x91
git.fusion.io/fusionio/fusion/utils.git/bigquery.(*BigQueryWrapper).GetTemplateSQL(0xc003b8e5a0, 0x331199e, 0x11, 0xc003b8e450, 0x393f0e0, 0xc00010e1c0, 0xc003b8e5a0, 0x0, 0x0)
        /builds/fusionio/fusion/controller/gopath/pkg/mod/git.fusion.io/fusionio/fusion/util...@v0.0.0-20200813012118-9675cdb4e965/bigquery/bigquery.go:105 +0x25d
git.fusion.io/fusionio/fusion/controller.git/stats.(*StatsDao).executeQuery(0xc000010418, 0xc0020cb600, 0x331199e, 0x11, 0xc000eb0000, 0xbd, 0x400, 0x88)
        /builds/fusionio/fusion/controller/stats/statsdao.go:401 +0xe3
git.fusion.io/fusionio/fusion/controller.git/stats.(*StatsDao).GetTeamAppExpScore(...)
        /builds/fusionio/fusion/controller/stats/statsdao.go:545
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetTeamAppExpScore(0xc001b54d40, 0xc0020cb600, 0x44ea57, 0x42fd7a, 0xc001470558, 0x101)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:459 +0x4e
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runQuery(0xc001b54d40, 0x330e1b0, 0xf, 0xc0030d2a80, 0xc0035995c0, 0xc0015ede40)
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:464 +0x7e
created by git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery
        /builds/fusionio/fusion/controller/stats/ins_mgr.go:485 +0x1a4


Corresponding go routines waiting on wait groups.
--
-Siddhesh.

Robert Engels

unread,
Aug 27, 2020, 7:03:41 AM8/27/20
to Kurtis Rader, Siddhesh Divekar, Ian Lance Taylor, golang-nuts
Ok, so your data collectors never complete. A simple change to make this easier to diagnose is to not spin up another collector controller at the 2 min mark if the previous has not completed. 

I would determine if the stuck collector is BQ or Elastic and check the server side logs. 

On Aug 26, 2020, at 11:29 PM, Kurtis Rader <kra...@skepticism.us> wrote:


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Robert Engels

unread,
Aug 27, 2020, 7:06:13 AM8/27/20
to Kurtis Rader, Siddhesh Divekar, Ian Lance Taylor, golang-nuts
Could be a bug in the http/2 implementation. I would disable http/2 and see if you encounter the problem. 

On Aug 27, 2020, at 6:02 AM, Robert Engels <ren...@ix.netcom.com> wrote:



Siddhesh Divekar

unread,
Aug 27, 2020, 11:10:19 AM8/27/20
to Robert Engels, Kurtis Rader, Ian Lance Taylor, golang-nuts
Ok, so your data collectors never complete. A simple change to make this easier to diagnose is to not spin up another collector controller at the 2 min mark if the previous has not completed. 

I would determine if the stuck collector is BQ or Elastic and check the server side logs. 
We start a data collector (DC) every 2 mins irrespective of whether the previous DC has completed or not, so we would have a few DCs stuck but not all. (2 from backtrace)

I would determine if the stuck collector is BQ or Elastic and check the server side logs. 
In server side logs we don't see any activity at all. That is the confusing part why regular user requests are not reaching the go server from the users.
Will check BQ and elastic logs around that time. But after restart everything was working fine.

Could be a bug in the http/2 implementation. I would disable http/2 and see if you encounter the problem. 
If we hit it again we will collect github.com/robaho/goanalyzer & see what is happening.
Between is there a compile time flag to disable http2 while building ?

Thanks,
--
-Siddhesh.

Rahul Yadav

unread,
Mar 20, 2021, 3:19:53 PM3/20/21
to golang-nuts
Any updates, I am also facing this issue in my case Golang server just stuck and openFD gets spiked.
-Rahul
Reply all
Reply to author
Forward
0 new messages