Hi Ian,
Thanks for replying.
We have a go server running which handles user requests & collects data from various sources like gcp's cloud sql and big query.
We are using shopify's sarama library to do kafka operations.
There are seeing lots of go routines in waiting state for several minutes.
Over the period of time around 587 goroutines have been spun up.
We see that two go routines are stuck on gcp big query and we are using wait groups there.
However, it's not clear why that would cause all other go routines to get hung & make cpu go high.
goroutine 3332131 [semacquire, 79 minutes]:
sync.runtime_Semacquire(0xc001c4fcf8)
/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc001c4fcf0)
/usr/local/go/src/sync/waitgroup.go:130 +0x64
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, 0xc002912c00, 0x330e1b0, 0xf, 0xc002912cf0, 0x3)
/builds/fusionio/fusion/controller/stats/ins_mgr.go:488 +0x1d7
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, 0xc002912db8, 0xc001870e68, 0x746121, 0xc0010fcaf8, 0x17)
/builds/fusionio/fusion/controller/stats/ins_mgr.go:567 +0xa0d
git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, 0xc002356810, 0x24, 0x32f7b78, 0x5)
/builds/fusionio/fusion/controller/stats/prefetcher.go:77 +0xf2
created by
git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats /builds/fusionio/fusion/controller/stats/prefetcher.go:100 +0xd8
goroutine 3332149 [semacquire, 79 minutes]:
sync.runtime_Semacquire(0xc0015ede48)
/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0015ede40)
/usr/local/go/src/sync/waitgroup.go:130 +0x64
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).runParallelQuery(0xc001b54d40, 0xc00249dc00, 0x330e1b0, 0xf, 0xc00249dcf0, 0x3)
/builds/fusionio/fusion/controller/stats/ins_mgr.go:488 +0x1d7
git.fusion.io/fusionio/fusion/controller.git/stats.(*InsMgr).GetMainUi(0xc001b54d40, 0xc00249ddb8, 0xc003200668, 0xc00407a520, 0xc003200590, 0x46ee97)
/builds/fusionio/fusion/controller/stats/ins_mgr.go:567 +0xa0d
git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).fetchMainUiTeamInterval(0xc001b56780, 0xc002356ba0, 0x24, 0x32f7b78, 0x5)
/builds/fusionio/fusion/controller/stats/prefetcher.go:77 +0xf2
created by
git.fusion.io/fusionio/fusion/controller.git/stats.(*Prefetcher).prefetchStats /builds/fusionio/fusion/controller/stats/prefetcher.go:100 +0xd8
I found the link below which kind of co-relates to our scenario.
Most of the go routines in the backtrace are in a net/http package so our suspicion is that above bug in our code might be causing that.
Even the bigquery is getting hung in net/http.
We are using go version 1.13.8 & are running on gcp kubernetes cluster on ubuntu 18.04 docker.
go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="
https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="
sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/builds/prosimoio/prosimo/pdash/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build048009048=/tmp/go-build -gno-record-gcc-switches"
Let me know if any other information is needed.