On Jun 17, 2021, at 11:19 AM, Peter Z <zjy19...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/1e6f8895-b6d2-418f-a942-b31d0ee6de1fn%40googlegroups.com.
On Thu, Jun 17, 2021 at 9:19 AM Peter Z wrote:
>
> The original post is on stackoverflow https://stackoverflow.com/questions/67999117/unexpected-stuck-in-sync-pool-get
>
> Golang ENV:
> go1.14.3 linux/amd64
>
> Description:
> We have about half a million agents running on each of our machines.The agent is written in Go. Recently we found that the agent may get stuck, no response for the sent requests. The metrics exported from the agent show that a channel in the agent(caching the request) is full. Deep into the goroutine stacks, we found that the goroutines consuming messages from the channel are all waiting for a lock.The goroutines Stack details are shown below.
That is peculiar. What is happening under the lock is that the pool
is allocating a slice that is GOMAXPROCS in length. This shouldn't
take long, obviously. And it only needs to happen when the pool is
first created, or when GOMAXPROCS changes. So: how often do you
create this pool? Is it the case that you create the pool and then
have a large number of goroutines try to Get a value simultaneously?
Or, how often do you change GOMAXPROCS? (And, if you do change
GOMAXPROCS, why?)
> The stack shows that all of the goroutines are waiting for the global lock in sync.Pool. But I can't figure out which gouroutine is holding the lock. There should be a gouroutine which has `sync.runtime_SemacquireMutex` in it's stack not at the top, but there isn't.
I don't think that is what you would see. I think you would see a
goroutine with pinSlow in the stack but with SemaquireMutex not in the
stack.
[******@****** ~]$ curl ******795/debug/pprof/goroutine?debug=1 2>/dev/null | grep pinSlow -B4
166 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x520ebd 0x51fdff 0x51fcd0 0x737fb4 0x73a836 0x73a813 0x97d660 0x97d60a 0x97d5e9 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
--
120 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x4f646f 0x51ed7b 0x51ff39 0x5218e7 0x73a8b0 0x97d660 0x97d60a 0x97d5e9 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
--
119 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x5269e1 0x5269d2 0x526892 0x51f6cd 0x51f116 0x51ff39 0x5218e7 0x73a8b0 0x97d660 0x97d60a 0x97d5e9 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
--
59 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x4d8291 0x4d5726 0x9857f7 0x9804ca 0x97d5d5 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
--
36 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x51ed98 0x51ed88 0x51ff39 0x5218e7 0x73a8b0 0x97d660 0x97d60a 0x97d5e9 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
--
10 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x4d8291 0x4d8856 0x9761b6 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
--
2 @ 0x438cd0 0x4497e0 0x4497cb 0x449547 0x481c1c 0x482792 0x482793 0x4824ee 0x4821af 0x7c6ddc 0x7c6dc3 0x7c8cdf 0x4689e1
# 0x449546 sync.runtime_SemacquireMutex+0x46 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/runtime/sema.go:71
# 0x481c1b sync.(*Mutex).lockSlow+0xfb /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:138
# 0x482791 sync.(*Mutex).Lock+0x271 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/mutex.go:81
# 0x482792 sync.(*Pool).pinSlow+0x272 /home/ferry/ONLINE_SERVICE/other/ferry/task_workspace/gopath/src/******/go-env/go1-14-linux-amd64/src/sync/pool.go:213
> Reproduce:
> Can't find a way to reproduce this problem for now.
It's going to be pretty hard for us to solve the problem without a reproducer.
So, you have about 500,000 processes running this agent on each machine, and each process has around 7,000 gorouines? Is that correct?
On Jun 21, 2021, at 6:31 AM, Peter Z <zjy19...@gmail.com> wrote:
So, you have about 500,000 processes running this agent on each machine, and each process has around 7,000 gorouines? Is that correct?Yes, that's exactly what I mean.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/186e4953-93ae-456a-aa09-cfdb59a17ef5n%40googlegroups.com.
So, you have about 500,000 processes running this agent on each machine, and each process has around 7,000 gorouines? Is that correct?Yes, that's exactly what I mean.
On Jun 22, 2021, at 7:07 AM, jake...@gmail.com <jake...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/e00c87d5-07b2-42ae-b295-880da866dc1cn%40googlegroups.com.
He is stating he has a cloud cluster consisting of 500k machines - each machine runs an agent process - each agent has 7000 Go routines.
Sorry, now I am completely confused.So, you have about 500,000 processes running this agent on each machine, and each process has around 7,000 gorouines? Is that correct?Yes, that's exactly what I mean.but then you say: "Only one process per machine".Is there a language barrier, or am I missing something?
On Jun 22, 2021, at 9:21 AM, Peter Z <zjy19...@gmail.com> wrote:
Sorry for a mistake: 'hyperthread closed', hyperthread is actually on.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/1732ca49-e526-4ec4-ac29-0f9a4b2f949cn%40googlegroups.com.