Having many HTTP connections is causing excessive thread creation

jas...@icloud.com

unread,

Jan 15, 2015, 10:02:49 PM1/15/15

to golan...@googlegroups.com

I've been hitting a problem trying to create a Go program which makes many concurrent HTTP requests (about 20k). My thought was having a goroutine per request, and the Go runtime would multiplex these across a small thread pool (roughly size of GOMAXPROCS). However, my program keeps crashing after running for a few minutes. Basically it fails with: runtime: program exceeds 10000-thread limit

Seems there's too many syscall's, so I did some analysis similar to what's in this thread: https://groups.google.com/forum/#!topic/golang-dev/qhgpxRS1Thg. With SetMaxTheads(500), I found:

→ grep "^goroutine " threads1-500.txt | perl -npe 's/\d+/n/' | sort | uniq -c

412 goroutine n [IO wait, 1 minutes]:

162 goroutine n [IO wait, 2 minutes]:

111 goroutine n [IO wait, 3 minutes]:

24 goroutine n [IO wait, 4 minutes]:

6570 goroutine n [IO wait]:

133 goroutine n [chan receive, 1 minutes]:

54 goroutine n [chan receive, 2 minutes]:

52 goroutine n [chan receive, 3 minutes]:

12 goroutine n [chan receive, 4 minutes]:

4894 goroutine n [chan receive]:

3906 goroutine n [chan send]:

484 goroutine n [runnable]:

1053 goroutine n [select, 1 minutes]:

297 goroutine n [select, 2 minutes]:

156 goroutine n [select, 3 minutes]:

25 goroutine n [select, 4 minutes]:

17562 goroutine n [select]:

2 goroutine n [semacquire]:

422 goroutine n [sleep]:

1 goroutine n [syscall, 6 minutes, locked to thread]:

1 goroutine n [syscall, 6 minutes]:

12 goroutine n [syscall]:

There were only 14 goroutines in [syscall] state, so I looked at the ones in [runnable] with Syscall in the call stack. They were basically all network operations related to my HTTP requests, either syscall.connect, syscall.write, or syscall.read. Here's one example:

goroutine 700108 [runnable]:

syscall.Syscall(0x62, 0x3829, 0xc36758ac4c, 0x10, 0xffffffffffffffff, 0x0, 0x24)

/usr/local/go/src/syscall/asm_darwin_amd64.s:20 +0x5

syscall.connect(0x3829, 0xc36758ac4c, 0xc300000010, 0x0, 0x0)

/usr/local/go/src/syscall/zsyscall_darwin_amd64.go:64 +0x56

syscall.Connect(0x3829, 0x4b93ac8, 0xc36758ac40, 0x0, 0x0)

/usr/local/go/src/syscall/syscall_unix.go:198 +0x7f

net.(*netFD).connect(0xc3675cde30, 0x0, 0x0, 0x4b93ac8, 0xc36758ac40, 0x0, 0x0, 0x0, 0x0, 0x0)

/usr/local/go/src/net/fd_unix.go:75 +0x6c

net.(*netFD).dial(0xc3675cde30, 0x4b96d58, 0x0, 0x4b96d58, 0xc31de53e00, 0x0, 0x0, 0x0, 0x0, 0x0)

/usr/local/go/src/net/sock_posix.go:139 +0x37a

net.socket(0x450f340, 0x3, 0x2, 0x1, 0x0, 0xc31de53e00, 0x4b96d58, 0x0, 0x4b96d58, 0xc31de53e00, ...)

/usr/local/go/src/net/sock_posix.go:91 +0x422

net.internetSocket(0x450f340, 0x3, 0x4b96d58, 0x0, 0x4b96d58, 0xc31de53e00, 0x0, 0x0, 0x0, 0x1, ...)

/usr/local/go/src/net/ipsock_posix.go:137 +0x148

net.dialTCP(0x450f340, 0x3, 0x0, 0xc31de53e00, 0x0, 0x0, 0x0, 0x200000003, 0x0, 0x0)

/usr/local/go/src/net/tcpsock_posix.go:156 +0x125

net.DialTCP(0x450f340, 0x3, 0x0, 0xc31de53e00, 0x4015000, 0x0, 0x0)

/usr/local/go/src/net/tcpsock_posix.go:152 +0x25c

fetch.dialSingle(0x450f340, 0x3, 0xc36758abe0, 0x15, 0x0, 0x0, 0x4b96cc8, 0xc31de53e00, 0xecc4a675a, 0xc20cdfbbee, ...)

.../src/fetch/dial.go:41 +0x200

fetch.func·001(0xecc4a675a, 0xc20cdfbbee, 0x47a0160, 0x0, 0x0, 0x0, 0x0)

.../src/fetch/dial.go:17 +0xbd

fetch.dial(0x450f340, 0x3, 0x4b96cc8, 0xc31de53e00, 0xc31c79db60, 0xecc4a675a, 0xcdfbbee, 0x47a0160, 0x0, 0x0, ...)

.../src/fetch/dial.go:31 +0x6f

fetch.func·002(0x450f340, 0x3, 0xc36758abe0, 0x15, 0x0, 0x0, 0x0, 0x0)

.../src/fetch/dial.go:19 +0x38d

net/http.(*Transport).dial(0xc2c29b8630, 0x450f340, 0x3, 0xc36758abe0, 0x15, 0x0, 0x0, 0x0, 0x0)

/usr/local/go/src/net/http/transport.go:479 +0x84

net/http.(*Transport).dialConn(0xc2c29b8630, 0x0, 0xc285048430, 0x4, 0xc36758abe0, 0x15, 0xc21c2faf00, 0x0, 0x0)

/usr/local/go/src/net/http/transport.go:564 +0x1678

net/http.func·019()

/usr/local/go/src/net/http/transport.go:520 +0x42

created by net/http.(*Transport).getConn

/usr/local/go/src/net/http/transport.go:522 +0x335

So what I'm wondering is, how do you have a large number of concurrent HTTP requests, without ending up with a thread per request? Is this not a common enough scenario with Go? Do I need to limit concurrency of dialConn myself, using a custom transport? Won't that hurt overall throughput though, and are there plans to address this issue in a future release of Go?

Thanks,

Jason

Dave Cheney

unread,

Jan 15, 2015, 11:10:59 PM1/15/15

to golan...@googlegroups.com

Can you upload the entire crash message? Does your program do a lot of file (not network) io in response to network requests? Each of those goroutines will consume a native thread under the current runtime.

minux

unread,

Jan 15, 2015, 11:24:18 PM1/15/15

to jas...@icloud.com, golang-nuts

On Thu, Jan 15, 2015 at 10:02 PM, <jas...@icloud.com> wrote:

I've been hitting a problem trying to create a Go program which makes many concurrent HTTP requests (about 20k). My thought was having a goroutine per request, and the Go runtime would multiplex these across a small thread pool (roughly size of GOMAXPROCS). However, my program keeps crashing after running for a few minutes. Basically it fails with: runtime: program exceeds 10000-thread limit

Are these request to the same server or to different servers?

Highly concurrent DNS requests will also consume a lot OS threads

if you're using the cgo based net package.

You can try building your program with the pure Go net package.

go build -installsuffix netgo -tags netgo -a -v yourprogram.go

(This is the latest workaround for go build -a for Go 1.4)

And see if that helps.

jas...@icloud.com

unread,

Jan 15, 2015, 11:30:53 PM1/15/15

to golan...@googlegroups.com, jas...@icloud.com

Mostly different servers. I wrote my own DNS resolver in Go, using miekg/dns, so no cgo calls there (and it's rate limited). I did try using my own dialer, which puts a rate limit on net.Dial(), and that did fix the thread creation problem. With a cap of 512 concurrent net.Dial() calls, it runs about half as slow. Is there a better work around? I'm already re-using the connection as much as possible.

One reason I suspect the throughput is half, is it seems once I hit 512 slow connects (keep in mind I have about 20k goroutines, each with a different host), the program stalls (some of these connect's take minutes it looks like). I tried setting the Timeout on the http.Client, but that lead me to a different problem (panic because the request canceler tries to close the same channel more than once), but that's maybe worthy of another thread.

Thanks,

Jason

PS, Dave: Working on uploading the full stack dump as well.

jas...@icloud.com

unread,

Jan 16, 2015, 12:11:16 AM1/16/15

to golan...@googlegroups.com, jas...@icloud.com

Attaching the full stack dump. This one was actually with the 512-cap on concurrent net.Dial() calls, so I guess I didn't fully fix the issue, just significantly postponed it.

What's odd about this particular stack dump, is it shows pretty much nothing in [syscall] or [runnable] state, almost all goroutines are blocked, so not sure why it still created 1000 threads:

412 goroutine n [IO wait, 1 minutes]:

162 goroutine n [IO wait, 2 minutes]:

111 goroutine n [IO wait, 3 minutes]:

24 goroutine n [IO wait, 4 minutes]:

6570 goroutine n [IO wait]:

133 goroutine n [chan receive, 1 minutes]:

54 goroutine n [chan receive, 2 minutes]:

52 goroutine n [chan receive, 3 minutes]:

12 goroutine n [chan receive, 4 minutes]:

4894 goroutine n [chan receive]:

3906 goroutine n [chan send]:

484 goroutine n [runnable]:

1053 goroutine n [select, 1 minutes]:

297 goroutine n [select, 2 minutes]:

156 goroutine n [select, 3 minutes]:

25 goroutine n [select, 4 minutes]:

17562 goroutine n [select]:

2 goroutine n [semacquire]:

422 goroutine n [sleep]:

1 goroutine n [syscall, 6 minutes, locked to thread]:

1 goroutine n [syscall, 6 minutes]:

12 goroutine n [syscall]:

Dave: Regarding your other question, I am doing some disk activity (via RocksDB) and some other network activity (via Kafka). I limit concurrent RocksDB calls to GOMAXPROCS (24 in my case), since those are cgo calls. The kafka work is just a handful of goroutines, it feeds off a channel that all other goroutines insert into.

Any ideas how to tell from the stack dump what's causing the thread creation? Having limited the net.Dial() calls to 512, I'm very confused now why it's still creating up to 1000 threads.

threads3-1000_clean.txt

minux

unread,

Jan 16, 2015, 12:37:14 AM1/16/15

to jas...@icloud.com, golang-nuts

Generally it's impossible to read the goroutine stack dump to find out what caused

excessive OS thread creation. The reason is simple, unless you see a lot of

[locked to thread] marking, the goroutine that caused a OS thread to be created

might have already finished, or executing otherwise normal code.

The correct way is to use "threadcreate" profiling (runtime/pprof or net/http/pprof).

Jason Douglas

unread,

Jan 16, 2015, 1:01:22 AM1/16/15

to minux, golang-nuts

Exactly. There are many places where code could be making a aystem call under the covers, and not much information to track those places down.

The correct way is to use "threadcreate" profiling (runtime/pprof or net/http/pprof).

My understanding was this is currently broken: https://github.com/golang/go/issues/6104

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/CB6Y_iHqpNU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

minux

unread,

Jan 16, 2015, 1:49:13 AM1/16/15

to Jason Douglas, golang-nuts

Right. I've totally forgotten about this issue.

One temporary workaround is to use Go 1.1 to diagnose this problem.

(Assume your code is still compilable as Go 1.1.)

Peter Waller

unread,

Jan 16, 2015, 1:01:25 PM1/16/15

to Jason Douglas, golang-nuts

This may or may not be obvious, but are you closing the response body?
If you don't, you may end up with more goroutines than expected.

ja...@eggnet.com

unread,

Jan 19, 2015, 8:29:16 PM1/19/15

to golan...@googlegroups.com

If you want to limit the number of threads created you could have a separate worker pool for file i/o and farm out file operations to those workers. i.e., don't use file read/write/open/close operations in the goroutines that are directly handling clients. Send file i/o operations to a channel with for example, 10 or 50 or whatever goroutines listening for commands. The number of workers you would want handling file i/o operations depends on your disk i/o situation.

Jason Douglas

unread,

Jan 19, 2015, 11:02:59 PM1/19/15

to ja...@eggnet.com, golan...@googlegroups.com

Yes, if the goroutines are mostly CPU intensive that solution would work, unfortunately they’re about 95% I/O bound (which is why I have so many of them).

I was somewhat able to fix the problem by rate-limiting the number of dial calls using a custom dialer with my http.Client. Whereas before my app could only run for minutes before crashing, now it can run for days. I could probably track down the remaining pieces of code that need rate-limiting, it’s just time consuming.

In general, I hope Go figures out a good way to deal with this type of issue in the future. One of the main draws to Go is how embarassingly easy it is to write resource-efficient asynchronous networking applications, using large numbers of goroutines. But this ease-of-use seems to break down, once you hit a certain limit. Having to manually wrap rate limiting around every system call feels messy, at best. I know this is a hard problem, especially when you don’t know if a system call will be blocking or not, but maybe could be fixed with more granular control over when Go decides to spawn new threads.

Thanks,

Jason

On Jan 19, 2015, at 5:29 PM, ja...@eggnet.com wrote:

If you want to limit the number of threads created you could have a separate worker pool for file i/o and farm out file operations to those workers. i.e., don't use file read/write/open/close operations in the goroutines that are directly handling clients. Send file i/o operations to a channel with for example, 10 or 50 or whatever goroutines listening for commands. The number of workers you would want handling file i/o operations depends on your disk i/o situation.

ja...@eggnet.com

unread,

Jan 20, 2015, 1:24:43 AM1/20/15

to golan...@googlegroups.com

Be careful of lumping all i/o problems together.

As a general rule, storage i/o is very different from network i/o. A given computer can handle an extremely large number of concurrent network sessions efficiently. You also do not want to stall your program because of some network conditions outside of your control, like latency, packet loss or congestion.

Storage i/o is local to the machine and in general, there is very little benefit to massive concurrency reading and writing to it. Unlike network i/o, when your storage subsystem is maxed out, you simply have to start queueing storage.

When you have a large number of OS threads reading and writing to storage, the operating system ends up serializing them anyway. You might as well do it in your application more efficiently.

The way I described handling this is more or less how you would handle writing an nginx clone in go, for the simple case of reading static files for example. You wouldn't just let go spawn as many OS threads as it wants, which is what can happen when mixing in disk reads with client connections (assuming you have a large library or are mixing in a heavy write work load, etc). At that point you have the apache model of a thread per client and you might as well actually use apache or something like it, spawn a thread per connection, and just use blocking i/o.

Is there something go could do about this? Perhaps. Maybe go could have a maximum number of dedicated threads for storage i/o (as opposed to network i/o). But storage threads would need to map to the underlying different physical drives for maximum performance (if it isn't all one storage system) and perhaps unevenly depending on various factors.

Unfortunately the bottom line is that unix like operating systems do not provide great evented i/o for storage, forcing you to use blocking calls. When that changes, I suspect go will adapt.

Jason Douglas

unread,

Jan 20, 2015, 2:56:05 AM1/20/15

to ja...@eggnet.com, golan...@googlegroups.com

Actually, I was talking about network I/O, not storage I/O :) I’ve already put rate limiting around my storage I/O calls (and connect() calls now too), for the reasons you mention below. What I’m attempting to implement is a lot like Nginx, actually, but I’m fetching data instead of storing it. It feels like having a few thousand goroutines doing connect(), read(), and write() socket system calls results in a fairly small thread pool, but when you get to 10’s of thousands they become a much bigger concern (especially if any system calls start blocking).

Jason

maxpow...@gmail.com

unread,

Jan 20, 2015, 11:37:22 PM1/20/15

to golan...@googlegroups.com, ja...@eggnet.com

This may or may not be helpful since it doesn't address the symptoms described per se, but have you considered asking yourself why am I even creating so many threads?
To my mind what you describe is called death by a thousand papercuts and I've experienced it before in other projects that in the end made us realize we had been taking the wrong approach.

Goroutines are light weight and that is cool and all, but you mention you are communicating with hundreds of different hosts simultaneously. I can't even grok what it is you might be trying to accomplish, but I have a feeling that this is not a problem with go, but possibly a problem with some design assumptions.

So let's not worry about your design and look at some possible tasks it may be that you are trying to accomplish, because no task I can think of should spawn that many independent threads of execution within a single process and my background is in massively scalable architecture.

I'm imagining you are doing one of a couple of possible things, probably because these are things I've been involved with, but they represent completely different approaches to what on the surface seem like similar tasks.

Task 1 an application that requires 1 way communication, i.e. the broadcast model.
Task 2 an application requiring two way communication such as chat, this can be 1:1 or 1:n, I have no idea what this model is called, but I call it the realtime gaming model.
Task 3 is the 2 way realtime stream, i.e. Skype model. Here the communication is 1:1 or 1:n with n being relatively small and the chief distinguisher from tasks 1 and 2 being a real need for high QoS to every connected client.

Task 1 is easiest to envision as something like a youtube stream. Your job mostly involves pushing large amounts of data out to various connected clients and only occasionally receiving data in. Inbound data is mostly signalling such as play, pause, fast-forward, rewind etc. Failure to recieve a packet can result in a minor hiccup or slightly degraded service, but on the whole it should be ok.

The immediate thought is to spawn a single thread or goroutine per connected client, the problem is that this results in quickly saturating the host machine, At first you run into conditions with locks, once you solve that you may squeeze a few hundred more clients out before you start hitting per process limits at the OS level. Once you raise those limits, then you start encountering hard limits such as saturation of the memory bus,the file io bus etc, but likely you're going to saturate your network connection first.

The best solution I've seen to this class of problem is to have the actual content at the network edges in a CDN and use the server side code merely to send authorization information for the client's video player. This works because it's static content. If your content isn't static such as the evening news, a live sports broadcast etc, then your real answer is to monitor the amount of outgoing bandwidth and spin up a new host each time you reach a certain threshold of dedicated bandwidth on the server and point the client at it. A good load balancer can do most of this work without breaking a sweat. I've been informed that this solution is actually called a relay model.

Task 2 is most fun to envision as a semi-realtime activity such as an team based FPS or even an MMORPG.
With this task, you have connected clients who are pushing nearly as much in as they are pulling out.
For example pressing the up arrow moves the player forward for as long as the up arrow is pressed and now you need to send a message to everyone that "Player Leeroy Jenkins pressed up arrow" for the entire time Leeroy is pressing it.

Failure to get a packet means that Leeroy might never appear to move, or move in a very jerky fashion (assuming you are sending position updates as well). There are several solutions to this problem that can all be implemented to dramatically reduce response time and increase the quality for the player. The first it to get rid of the whole idea of a thread per client, it's absolutely not needed and a fast way to waste resources. You need a thread per shard and how you shard will determine what that looks like. Most servers will shard by map/zone. This works good for map based shoot 'em ups, but absolutely sucks when you start talking about MMORPGs where the world is supposed to be huge and seamless. So how do you shard? Well when I design game engines I shard by priority of data and then filter by player distance from data source. Every player is actually present in all threads, but what they are sent is dictated by how far the data travels (in game) and how far they are from the data source.

Example, Leeroy Jenkins is running towards a monster, but both Leeroy and that monster are beyond another players field of view. In this case, there is no reason to even bother sending the packet. The view field constraint (aka fog of war), can easily be calculated in a single instruction on a modern CPU. If the player can't see it, they don't need to know about it. Furthermore, movement including spell effects are on a high priority thread but things like chat are on a low priority thread if they are even on the same server. In fact I've used IRC in the past for all chatter that wasn't local, i.e. within the player's current horizon. I've used pipes to communicate events between processes, literally piping changes in state to /tmp/ai /tmp/movement for things that might need to be cross domains. For example, Leeroy is running to towards a monster, the monster needs to react, but doesn't need to be sitting in the same thread as the players. Instead an AI server receives the event through a write to /tmp/ai, makes a decision that it's time to wipe out Leeroy and all his friends and begins to move by writing the movement instruction to /tmp/movement, no threads and no locks needed, these are now distinct processes and seem to scale much better than threads would on a multicore machine.

Now if you have an encounter situation such as a raid, it may be best to move those off into their own little server in their own little world, but the same rules apply. Keep your data domains tiny, offload what isn't absolutely necessary for the server to handle itself, especially if there are already existing solutions, and use your server's resources solely to drive a subset of the game or business logic. If possible use a publish/subscribe model and good message queue system to handle the actual act of getting the data to the player. This way it takes a high speed route over the back bone to the edge nearest the client using a protocol that is likely prioritized, and then from there, makes the long slow haul over the last mile.

The third task is superficially more difficult, you need to maintain a realtime a/v stream with a minimum of latency and hiccups. Many people use rtsp for this and a thread per client model, but my experience is that thread per client reduces available resources. If possible your server should be responsible only for setting up the call by introducing the clients to eachother and then handling signalling. The a/v aspect should ideally be direct between the two peers with.

The problem is that a lot of ISPs will actually mess with this sort of traffic and "my calls won't complete" or "I tried to answer but no one was there" is a common complaint.

In this case, you should have a single thread per conversation, and pipe the output of one stream (at this point you are either using tcp or have tried to reinvent it) directly into the input of the other stream. This task should be handled with no inspection or intervention if at all possible. You are aiming to minimize the acts of copying data.

From there, you just have a global thread for signalling and again signalling occurs out of band, since really all you are doing is making or breaking connections for the clients.

If at all possible, you should take any steps you can to minimize duplication of the a/v stream data. The approach we took involved a custom network driver that let us tell the server to directly memory map data any two streams at the level of the network interface. I believe this may have caused the data to bypass the OS entirely, but I do know we went from 20 conversations on our test boxes to over 300 by using this.
Unfortunately it was a proprietary lib and was created by the network card manufacturer on behalf of my employer(client actually because this was a consulting gig), in exchange for a large order of network cards.

Again I apologize if I'm over stepping my bounds. I realize that none of this answers your question about go and sadly I'm a complete noob at go and can't offer a golang solution. I do hope that this information is at least somewhat helpful though, having tackled similar problems in the past and seeing at least in my mind the death by a thousand papercuts anti-pattern :)

ja...@eggnet.com

unread,

Jan 21, 2015, 1:00:02 AM1/21/15

to golan...@googlegroups.com

Just FYI connect() per se will not block in go. But the DNS lookup required to actually handle the connect might.

At least on Linux the underlying call is blocking, the libc function getaddrinfo. But go might limit the number of concurrent DNS lookups, I have no idea. But I suspect that it doesn't limit it, and you'll end up spawning a thread for every concurrent DNS lookup you need.

I suggest using netgo, which is a pure go resolver that does not block or spawn threads. It's a bit of a pain to get working in go 1.4 but worth it.

Additionally if you find yourself connecting to the same addresses frequently you'll probably end up wanting to create your own DNS cache within your code.

In any case, that's probably enough dart throwing for me :)

Christopher Sebastian

unread,

Apr 23, 2015, 9:17:01 AM4/23/15

to golan...@googlegroups.com, ja...@eggnet.com

I just want to thank "maxpow...@gmail.com" for his excellent, thorough, thoughtful reply. I found it very interesting and educational.

Reply all

Reply to author

Forward