On Sat, Mar 5, 2016 at 10:10 AM, <
carl.mas...@gmail.com> wrote:
>
> The problem that I am seeing is that sometimes the Stat syscalls take longer
> than expected, and tie up whole OS threads. This results in more threads
> being created in the scheduler, until at last pthread_create fails and the
> entire program crashes. Looking at the the dump of goroutines shows around
> 6000 goroutines, with about 280 of them waiting on the Stat syscall. I
> don't believe this is a recoverable error, since the failure appears to
> actually be coming from runtime/cgo rather than goroutine trying to stat.
See
https://golang.org/issue/7903 for some discussion on this general
issue.
> Inspecting the scheduler source code, I looked for where sched.mcount is
> ever decremented and didn't find any place. I also don't see anywhere
> leading to how to limit the number of threads in the system aside from
> runtime/debug.SetMaxThreads, which is fatal.
Both correct.
> 1. How do I limit the number of threads created? Working with Goroutines
> has been such a pleasure; it would be a shame if I had to contort my program
> to be aware of threads in order to not crash.
As discussed on issue 7903, we have not found a good general solution
for this problem. It's very hard for the Go runtime to know when it
is OK to delay an operation waiting for another operation to complete.
So, yes, at present, you unfortunately need to contort your program.
> 2. How are threads ever released back to the operation system? Are they
> going to stick around forever based on the highest spike of syscalls in the
> history of the process? I noticed that even in times of low load the thread
> count of my program hovered around 200 (as per /prod/pid/status)
At present threads are never released back to the operating system.
> 3. Can thread creation failure be a recoverable error? In my case, it
> would be much better if thread creation failure was recoverable rather than
> fatal. Even hanging until a new thread could be created or reused would be
> better, since it wouldn't abruptly leave all my open files and network
> connections. Being able to profile a slow program that is caused by thread
> starvation would be a much better stability story than aborting.
New threads are created independently of any goroutine context. It's
not obvious how failure to create a thread could be reported to the
program, or what the program could do to recover.
Hanging until a new thread can be created will solve some problems but
create other ones: some kinds of programs would silently deadlock.
There is some discussion of this at
https://golang.org/issue/4056 .
> I also some tangential questions that came up in my debugging:
>
> 4. The failure I see seems to come from cgo:
>
> runtime/cgo: pthread_create failed: Resource temporarily unavailable
> SIGABRT: abort
> PC=0xb6e94f96 m=2
>
>
> I don't really have anything special going on here, it's a plain ol' Go
> program with no C involved by my actions (and I only use the standard
> library). Why is the cgo mentioned in the output when crashing?
By default, if you did not build with CGO_ENABLED=0, and program that
imports the net or os/user packages is a cgo program. In a cgo
program, every new thread is created by the runtime/cgo library. This
error message is clearly somewhat misleading and probably should be
changed.
> 5. What is the difference between runnable and syscall in the goroutine
> traceback output? Both seem to be possible while hanging on a syscall.
> Example:
>
> goroutine 7464 [syscall]:
> syscall.Syscall(0xc3, 0x14634ba0, 0x14573854, 0x0, 0x0, 0x4, 0x149d04)
> /home/carl/.golive/src/syscall/asm_linux_arm.s:17 +0x8
> syscall.Stat(0x14634b70, 0x25, 0x14573854, 0x0, 0x0)
> /home/carl/.golive/src/syscall/zsyscall_linux_arm.go:1613 +0x8c
>
>
> and
>
> goroutine 7192 [runnable]:
> syscall.Syscall(0xc3, 0x13c1fa70, 0x13bd38e4, 0x0, 0xffffffff, 0x0, 0x2)
> /home/carl/.golive/src/syscall/asm_linux_arm.s:17 +0x8
> syscall.Stat(0x13c1fa40, 0x25, 0x13bd38e4, 0x0, 0x0)
> /home/carl/.golive/src/syscall/zsyscall_linux_arm.go:1613 +0x8c
>
>
> Almost all of the hung Stat calls are in runnable, with only a tiny amount
> in syscall.
The runtime will only run a limited number of goroutines
simultaneously, as controlled by GOMAXPROCS. In the absence of other
information, my guess would be that you had many goroutines make a
system call simultaneously. As each one enter syscall.Syscall, it
went into syscall state, and freed up another goroutine slot,
permitting another goroutine to enter syscall.Syscall. The goroutines
enter system calls faster than the calls completed. Then the calls
started completing. Each completed system call moved the goroutine
back to runnable state, but now the burst happened on the other side:
the system calls completed more quickly than the schedule was able to
handle them. The result is a bunch of goroutines that have completed
the system call and are in runnable state waiting for a scheduler slot
to actually continue running.
Just a guess, though.
> 6. How much memory does a thread typically take, or take by default in Go?
> My ulimit is pretty high so I am pretty sure I am not hitting it. The only
> other reason I can think of pthread_create failing is memory related.
Threads are created with the system default thread stack size.
> 7. There is a SIGABRT in my output. Is this caused by the Go runtime
> trying to end itself, or from some outside source? Would it make any sense
> to try and catch SIGABRT?
The SIGABRT is because the fatal error when pthread_create fails calls
abort. Catching that signal would not permit the program to continue.
Ian