All goroutines hung in futex / sleep

651 views
Skip to first unread message

Michael Andersen

unread,
Feb 22, 2018, 9:25:18 PM2/22/18
to golang-nuts
Hi all

I have a complex program that when under load will very reproducibly freeze every goroutine simultaneously. It then makes no progress at all, even if left for hours. I'm posting here because I don't know of anything that can cause this behavior so I don't even know where to begin debugging. When it "freezes", every goroutine appears to be no longer scheduled, no matter how simple. Even this at the start of main() ceases to print to stdout:

go func() {
for {
time.Sleep(3 * time.Second)
fmt.Printf("Still alive\n")
}
}()

The system is nowhere near OOM, the goroutine count is large but reasonable just before the freeze (<2k). After it freezes the process is still running, and attaching sysdig shows it is stuck spinning in futex, with only this showing up over and over:

637779 17:21:56.254826712 20 prog (43085) < futex res=-110(ETIMEDOUT) 
637782 17:21:56.254827305 20 prog (43085) > futex addr=10D5FA0 op=0(FUTEX_WAIT) val=0 
637783 17:21:56.254828132 20 prog (43085) > switch next=0 pgft_maj=0 pgft_min=60361 vm_size=20710168 vm_rss=10792276 vm_swap=0 

The "frozen" program still responds to SIGQUIT and dumps out the goroutines, but given that this is not a minimal reproducer (which I have not managed to make) I don't know which parts of that are useful. I put all of it here: https://gist.github.com/immesys/0b741e4ea18979614d8419fa9c007098

My main question is what sort of bugs can cause the whole program to lock up? Even if some goroutines were deadlocked, why would that stop everything from net/http/pprof to a printf loop from working?

Some tidbits:

I have a core dump so I can inspect things with delve if I know what I am looking for
Building/running with -race doesn't print anything
I came across this (https://groups.google.com/forum/#!msg/golang-nuts/PMm8nH0yaoA/mb-cnKmZlb4J) which describes a similar occurency but I don't interact with syslog, at least not directly.
I am getting this on go 1.10 but I rebuilt on 1.9.4 and I get the same behavior.
I am on linux amd64 kernel 4.10
It only takes about two minutes to reproduce.
When frozen, only a single CPU core is pegged, the rest of the system is fine.

Any help at all would be appreciated, thanks
Michael



Ian Lance Taylor

unread,
Feb 22, 2018, 11:12:30 PM2/22/18
to Michael Andersen, golang-nuts
I don't know what is happening with your program.

This kind of thing can happen if you have a goroutine that is running
in a tight loop with no function calls or memory allocations. The
current Go scheduler is non-preemptive, meaning that nothing will stop
that loop. If that loop occurs while holding a lock, it could block
the entire rest of the program from running. However, you would see
this in the stack trace. This problem with the current scheduler is
https://golang.org/issue/10958.

This kind of thing can happen if you are using an in-process FUSE file
system implemented by goroutines in your program. The Go runtime
believes that some system calls, such as pipe or socket, never block.
If you have somehow set things up so that those system calls enter
your FUSE file system and depend on some other goroutine running, it
is possible that that goroutine will never be scheduled. I don't know
of any bugs like this at present but they have existed in the past.
Of course if you aren't using a FUSE file system then this is not the
problem.

This kind of thing can happen if you use assembler code to do a
blocking operation, or if you use syscall.Rawsyscall to call a system
call that blocks. That can confuse the scheduler and lead to a
deadlock. Don't do that.

None of these are likely, but you asked for suggestions, and that's
what I've come up with.

Ian

Michael Andersen

unread,
Feb 22, 2018, 11:48:25 PM2/22/18
to Ian Lance Taylor, golang-nuts
Hi

Thanks for your suggestions. This freeze happens to coincide with a larger number of cgo calls which in turn most likely do blocking read/write from sockets. How are those treated by the scheduler?

I am not doing anything FUSE related, nor do I have assembly code. The only "interesting" stuff is the use of a c library - librados.

In the goroutine dump from SIGQUIT, some stacks are followed by register dumps. Are those the only ones currently running? All of those are in runtime: 16 are in runtime.futex, 2 are in runtime.notetsleepg and one in runtime.gopark, which would make it seem like it was not an un-preemptable tight loop, but I am not sure how to parse the SIGQUIT output.

Thanks
Michael

Michael Andersen

unread,
Feb 23, 2018, 2:10:56 AM2/23/18
to Ian Lance Taylor, golang-nuts
I have found that GODEBUG=schedtrace still prints when the program is frozen but I cannot find much documentation on deciphering its output. The full output is here: https://gist.github.com/immesys/7f213c63c54ce60ba72d1cbc9ffcc4cb but here are the first few lines. It looks like this is garbage collector related?

SCHED 111364ms: gomaxprocs=40 idleprocs=0 threads=78 spinningthreads=0 idlethreads=65 runqueue=31 gcwaiting=1 nmidlelocked=1 stopwait=1 sysmonwait=0
P0: status=3 schedtick=36699 syscalltick=29158 m=-1 runqsize=0 gfreecnt=45
P1: status=3 schedtick=37370 syscalltick=30563 m=-1 runqsize=0 gfreecnt=31
P2: status=3 schedtick=36650 syscalltick=29478 m=-1 runqsize=0 gfreecnt=17
P3: status=3 schedtick=42395 syscalltick=34028 m=-1 runqsize=0 gfreecnt=38
P4: status=3 schedtick=40531 syscalltick=33546 m=-1 runqsize=0 gfreecnt=27
P5: status=3 schedtick=41369 syscalltick=31884 m=-1 runqsize=0 gfreecnt=19
P6: status=3 schedtick=39302 syscalltick=32162 m=-1 runqsize=0 gfreecnt=39
P7: status=3 schedtick=39739 syscalltick=32624 m=-1 runqsize=0 gfreecnt=34
P8: status=3 schedtick=32585 syscalltick=24358 m=-1 runqsize=0 gfreecnt=45
P9: status=3 schedtick=40376 syscalltick=30399 m=-1 runqsize=0 gfreecnt=27
P10: status=3 schedtick=31018 syscalltick=22164 m=-1 runqsize=0 gfreecnt=40
P11: status=3 schedtick=32043 syscalltick=24479 m=-1 runqsize=0 gfreecnt=53
P12: status=3 schedtick=36459 syscalltick=25222 m=-1 runqsize=0 gfreecnt=21
P13: status=3 schedtick=38445 syscalltick=28317 m=-1 runqsize=0 gfreecnt=38
P14: status=3 schedtick=39533 syscalltick=28189 m=-1 runqsize=0 gfreecnt=29
P15: status=3 schedtick=37009 syscalltick=27231 m=-1 runqsize=0 gfreecnt=29
P16: status=3 schedtick=36957 syscalltick=26211 m=44 runqsize=0 gfreecnt=45
P17: status=3 schedtick=35632 syscalltick=26481 m=-1 runqsize=0 gfreecnt=48
P18: status=3 schedtick=36228 syscalltick=24751 m=-1 runqsize=0 gfreecnt=59
P19: status=3 schedtick=32479 syscalltick=22970 m=-1 runqsize=0 gfreecnt=48
P20: status=3 schedtick=34090 syscalltick=25864 m=-1 runqsize=0 gfreecnt=12
P21: status=3 schedtick=35409 syscalltick=24369 m=-1 runqsize=0 gfreecnt=6
P22: status=3 schedtick=42235 syscalltick=28607 m=-1 runqsize=0 gfreecnt=16
P23: status=3 schedtick=29733 syscalltick=19429 m=-1 runqsize=0 gfreecnt=43
P24: status=3 schedtick=34757 syscalltick=22141 m=-1 runqsize=0 gfreecnt=26
P25: status=3 schedtick=31434 syscalltick=21767 m=-1 runqsize=0 gfreecnt=23
P26: status=3 schedtick=31597 syscalltick=21657 m=-1 runqsize=0 gfreecnt=53
P27: status=3 schedtick=32539 syscalltick=22653 m=-1 runqsize=0 gfreecnt=34
P28: status=3 schedtick=29503 syscalltick=20588 m=-1 runqsize=0 gfreecnt=28
P29: status=3 schedtick=30842 syscalltick=20199 m=-1 runqsize=0 gfreecnt=41
P30: status=3 schedtick=29787 syscalltick=19492 m=-1 runqsize=0 gfreecnt=15
P31: status=3 schedtick=28226 syscalltick=19101 m=-1 runqsize=0 gfreecnt=32
P32: status=3 schedtick=34254 syscalltick=24135 m=-1 runqsize=0 gfreecnt=27
P33: status=3 schedtick=34567 syscalltick=22554 m=-1 runqsize=0 gfreecnt=36
P34: status=3 schedtick=30416 syscalltick=19722 m=-1 runqsize=0 gfreecnt=33
P35: status=1 schedtick=19536 syscalltick=13084 m=0 runqsize=0 gfreecnt=50
P36: status=3 schedtick=29336 syscalltick=19015 m=-1 runqsize=0 gfreecnt=28
P37: status=3 schedtick=29794 syscalltick=19499 m=-1 runqsize=0 gfreecnt=40
P38: status=3 schedtick=31748 syscalltick=20359 m=-1 runqsize=0 gfreecnt=35
P39: status=3 schedtick=31851 syscalltick=20260 m=-1 runqsize=0 gfreecnt=46


Michael Andersen

unread,
Feb 23, 2018, 4:20:58 AM2/23/18
to Ian Lance Taylor, golang-nuts
Ok I found the problem. A tight loop was indeed preventing the garbage collector from proceeding, and then the GC was stopping everything else. The loop contained atomic.LoadUint64 and atomic.CompareAndSwapUint64, but neither of those seem to qualify as preemption points. Fixing that loop fixes everything.

Thanks to Ian for pointing me on the right path and thanks to Florin Pățan on gophers.slack.com for pointing me at Delve which eventually let me find the problematic code.

I know this is not at all the job of the race detector, but given that compiling with -race is the first thing we do when a program experiences "deadlock" type behavior, and I have seen several people being thrown off by the way go behaves when there are tight loops, can we not detect that the GC has stalled for way too long and print something out? Like "Goroutine #545 is preventing GC" or something? Just a thought.

Thanks
Michael

Ian Lance Taylor

unread,
Feb 23, 2018, 10:43:03 AM2/23/18
to Michael Andersen, golang-nuts
On Fri, Feb 23, 2018 at 1:19 AM, Michael Andersen <mic...@steelcode.com> wrote:
>
> Ok I found the problem. A tight loop was indeed preventing the garbage
> collector from proceeding, and then the GC was stopping everything else. The
> loop contained atomic.LoadUint64 and atomic.CompareAndSwapUint64, but
> neither of those seem to qualify as preemption points. Fixing that loop
> fixes everything.
>
> Thanks to Ian for pointing me on the right path and thanks to Florin Pățan
> on gophers.slack.com for pointing me at Delve which eventually let me find
> the problematic code.
>
> I know this is not at all the job of the race detector, but given that
> compiling with -race is the first thing we do when a program experiences
> "deadlock" type behavior, and I have seen several people being thrown off by
> the way go behaves when there are tight loops, can we not detect that the GC
> has stalled for way too long and print something out? Like "Goroutine #545
> is preventing GC" or something? Just a thought.

That seems like a good suggestion. Want to open an issue for it at
https://golang.org/issue? Thanks.

There is a pretty good chance that the general preemptible loop
problem will be fixed in 1.11.

Ian
Reply all
Reply to author
Forward
0 new messages