Dealing with a crash in production of: runtime/cgo: pthread_create failed: Resource temporarily unavailable with our new Proxy daemon.

ralph....@sendgrid.com

unread,

Jan 28, 2015, 1:18:05 PM1/28/15

to golan...@googlegroups.com

Hello Golang-Nuts group,

We recently started seeing a crash in production related to our long running Go daemon crashing with: pthread_create failed: Resource temporarily unavailable. We've been analyzing crash logs and have been working on trying to determine to the cause of the crash. I apologize in advance for the long winded email but I'm trying to be detailed enough so as to provide a little backstory of what we're experiencing.

Let me explain a little background of what this daemon does:

It's really no more than a TCP proxy that allows us to send a high volume of traffic out of our private network basically utilizing io.Copy to do the bulk of the data-transfer
It has a a few rather simple datastructures that maintain some stats (bookkeeping) of the concurrency utilization of a cluster of ips bound to the local box that it uses for local addresses

These datastructures are relatively small and have mutexes around them protecting their reads and writes
We see no contention so far in the use of these mutexes around these datastructures

It basically, binds to these local addresses and dials out to remote MX servers around the world
This process is long running running linux64bit on Centos 6.0 using the Go 1.3.3
We have it running in production with GOMAXPROCS=`nproc`
The process is largely IO bound, does not do any DNS lookups and relies on a third-party library for logging (which does some cgo calls for writing to Syslog).
Here is a summary of the Goroutines in their various states:

Total goroutines at the time of crash: 35949
88 [syscall]
24020 [IO wait]
1017 [runnable]
9508 [select]
0 [running]
15 [chan receive]
0 [chan send]
1 [finalizer wait]
2 [sleep]
1382 [semacquire]

On average we've seen the daemon use anywhere from 75 native threads to around 500 threads but in some cases it's spiking. Would this many threads be considered high? One thing I discovered was that our soft limit on max threads is only set to 1024. Of course we can bump it but my gut feeling is that our daemon is creating excessive threads possibly.
An additional observation: To our surprise we noticed quite many goroutines in both [IO wait] and [select] where they have been in those states for many, many minutes. These are connections that have been sitting around for some reason which we are also looking into fixing. Obviously this is a concern but even if we had a lot of goroutines sitting around waiting for IO or select, these wouldn't necessarily cause the Go runtime to spawn additional native threads right?

I would have attached the stacktrace to this email but it's rather large around 100MB. So here is a dropbox link for further analysis (if anybody is willing):

https://db.tt/7PhdjbhB

I would appreciate it if the group can give me any hints on where to look and hopefully provide some additional insight as to what could be going on. If it's simply a matter of bumping up our thread limit we can definitely do that as this crash happens a few times a week so far. Another thing that could possibly be screwing us over is that we are logging quite a bit in production as we continue to vet out this daemon (since it's new)

Any help is appreciated and please let me know if I can provide more details.

Thanks,

- Ralph Caraveo (@deckarep)

- SendGrid

Ian Lance Taylor

unread,

Jan 28, 2015, 2:46:33 PM1/28/15

to ralph....@sendgrid.com, golang-nuts

On Wed, Jan 28, 2015 at 10:18 AM, <ralph....@sendgrid.com> wrote:
>
> We recently started seeing a crash in production related to our long running
> Go daemon crashing with: pthread_create failed: Resource temporarily
> unavailable.

I believe this can happen on GNU/Linux if your program uses cgo and if
thread A is in the Go runtime starting up a new thread B while thread
C is execing a program. The underlying cause is that while one thread
is calling exec the Linux kernel will fail attempts by other threads
to call clone by returning EAGAIN. (Look for uses of the in_exec
field in the kernel sources.)

You said your program uses cgo, so if it also calls exec (that is,
uses the os/exec package or calls os.StartProcess), then this may be
the problem.

A simple fix might be to edit runtime/cgo/gcc_linux_amd64.c to use
some sort of loop when pthread_create returns EAGAIN. If that seems
to help, please open a bug report and we can address this for 1.5 one
way or another.

Ian

ralph....@sendgrid.com

unread,

Jan 28, 2015, 3:09:56 PM1/28/15

to golan...@googlegroups.com, ralph....@sendgrid.com

Thanks Ian,

So we're not actually using os.exec or spawning any process using os.StartProcess either. We are basically just importing a library that is using cgo under the hood to write to the SYSLOG facility. A possibility here is that perhaps writing to syslog is backing up causing the additional threads to be created. (Just a theory so far...need to investigate this path).

Thanks,

-Ralph Caraveo

Tamás Gulácsi

unread,

Jan 29, 2015, 1:05:02 AM1/29/15

to golan...@googlegroups.com

A workaround may be having a separate process logging to syslog with cgo, and your main program sends logs through Unix pipes or whatever.

Not nice, but may help make the main program cgo-free.

Dave Cheney

unread,

Jan 29, 2015, 3:10:42 AM1/29/15

to golan...@googlegroups.com

That's a sound theory, there will be one is thread consumed per cgo operation in flight.

ralph....@sendgrid.com

unread,

Jan 29, 2015, 1:26:53 PM1/29/15

to golan...@googlegroups.com

Yes,

So far it's looking more and more like that is the culprit. We will have to rework our strategy for logging to syslog a bit and see how this goes.

I'll report back my findings should someone else find this useful.

-Ralph Caraveo

Lars Tørnes Hansen

unread,

Jan 29, 2015, 6:52:03 PM1/29/15

to golan...@googlegroups.com

The Varnish cache is using shared memory for logging purposes:

https://www.varnish-cache.org/docs/trunk/reference/vsm.html

Maybe it is useful to look at the (BSD licensed) source code.

/Lars

Kevin Gillette

unread,

Jan 31, 2015, 4:31:06 AM1/31/15

to golan...@googlegroups.com, ralph....@sendgrid.com

On Wednesday, January 28, 2015 at 12:46:33 PM UTC-7, Ian Lance Taylor wrote:

On Wed, Jan 28, 2015 at 10:18 AM, <ralph....@sendgrid.com> wrote:
>
> We recently started seeing a crash in production related to our long running
> Go daemon crashing with: pthread_create failed: Resource temporarily
> unavailable.

I believe this can happen on GNU/Linux if your program uses cgo and if
thread A is in the Go runtime starting up a new thread B while thread
C is execing a program. The underlying cause is that while one thread
is calling exec the Linux kernel will fail attempts by other threads
to call clone by returning EAGAIN. (Look for uses of the in_exec
field in the kernel sources.)

I've run into this issue while using os/exec concurrently with a net/http Client making requests that required DNS lookups. The panic was reasonably easy and quick to reproduce. I was not specifically using cgo other than implicitly via the net package; I rebuilt the stdlib with the netgo tag, and the issue has not reoccurred.

Sokolov Yura

unread,

Feb 1, 2015, 1:50:33 AM2/1/15

to golan...@googlegroups.com

In the init script while it runned as root do

ulimit -u 1000000

This will increase allowed number of processes, and linux treats threads as processes.

Possibly there is a need to do a second call to ulimit after user is changed

Ralph Caraveo

unread,

Feb 7, 2015, 6:49:36 PM2/7/15

to Sokolov Yura, golan...@googlegroups.com

Hi Group,

I've been meaning to follow up on this thread. It turned out to be that our Syslog library was the problem after all. We were able to correlate that the excessive threads were introduced when we toggled to logging to the Syslog facility in our production environment. Once we did this, excessive threads from the Cgo calls logging to Syslog were created.

We ended up toggling this back off in production and saw native OS threads go way, way down. In fact our daemon is running nicely with only a handful of threads ever alive.

The threadcreate debugging profile helped us track this down and instead of writing to Syslog we are now writing to STDERR and piping the output to the Logger shell command. Smooth sailing at the point.

Im not exactly sure how to classify this issue but we now will be very cautious of introducing Cgo calls going forward. Live and learn. =)

Thanks for everybody's suggestions and help. This is why the Go community kicks ass!

-Ralph Caraveo
@deckarep

SendGrid

> --
> You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/8gszDBRZh_4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dave Cheney

unread,

Feb 8, 2015, 6:39:53 PM2/8/15

to golan...@googlegroups.com

Excellent news. Thanks for following up.

I agree with your recommendation, excessive cgo usage breaks Go's promise of lightweight concurrency.

Reply all

Reply to author

Forward