lots of goroutines waiting infinitely to read from connection

975 views
Skip to first unread message

kartik

unread,
Sep 2, 2015, 8:29:17 AM9/2/15
to golang-nuts
Hi,
go version go1.4.1 linux/amd64

I have implemented a http server which does the work of a reverse proxy with some additional functionality. It has been running in production for more than 2 months now and I've observed that the memory keeps on increasing.

When I used profiling(browser based), I got the following :- 

When I inspected the goroutine stack dump I found that more than 20K goroutines are waiting for I/O with the same call trace :-

goroutine 763144402 [IO wait, 46971 minutes]:
net.(*pollDesc).Wait(0xc672766f40, 0x72, 0x0, 0x0)
	/usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc672766f40, 0x0, 0x0)
	/usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc672766ee0, 0xc651362000, 0x1000, 0x1000, 0x0, 0x7fe5bc41ad80, 0xc65133d5d0)
	/usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc476b477f0, 0xc651362000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:121 +0xdc
net/http.(*liveSwitchReader).Read(0xc672725a88, 0xc651362000, 0x1000, 0x1000, 0x2, 0x0, 0x0)
	/usr/local/go/src/net/http/server.go:214 +0xab
io.(*LimitedReader).Read(0xc4191cbea0, 0xc651362000, 0x1000, 0x1000, 0x2, 0x0, 0x0)
	/usr/local/go/src/io/io.go:408 +0xce
bufio.(*Reader).fill(0xc540048900)
	/usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).ReadSlice(0xc540048900, 0xc54008eb0a, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:295 +0x257
bufio.(*Reader).ReadLine(0xc540048900, 0x0, 0x0, 0x0, 0xc1b6c61900, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:324 +0x62
net/textproto.(*Reader).readLineSlice(0xc71b9e61b0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/textproto/reader.go:55 +0x9e
net/textproto.(*Reader).ReadLine(0xc71b9e61b0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/textproto/reader.go:36 +0x4f
net/http.ReadRequest(0xc540048900, 0xc42eb329c0, 0x0, 0x0)
	/usr/local/go/src/net/http/request.go:598 +0xcb
net/http.(*conn).readRequest(0xc672725a40, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/server.go:586 +0x26f
net/http.(*conn).serve(0xc672725a40)
	/usr/local/go/src/net/http/server.go:1162 +0x69e
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:1751 +0x35e

Calls to my server come from other servers(and not browser). Secondly, my server doesn't close the connection ever(for performance reasons), but clients do(whenever response time exceeds, say 200ms).

I haven't configured any "ReadTimeout"(or writeTimeout) in my server, I wonder why these goroutines are waiting for days.

Has anyone else faced a similar problem? Does this occur when the client closes the connection while the goroutine is still reading? The memory on my server has been increasing steadily as the # of goroutines that enter into this "wait" state keeps on increasing.

Thanks for your patience
Regards
kartik

jo...@claireit.nl

unread,
Sep 2, 2015, 8:37:51 AM9/2/15
to golang-nuts
Yes, keeping a goroutine around blocking is a kind of leak you are experiencing now. Best to have a check in the goroutine that sends a return when it's not necessary any more, so it can be garbage-collected.

Op woensdag 2 september 2015 14:29:17 UTC+2 schreef kartik:

James Bardin

unread,
Sep 2, 2015, 10:04:23 AM9/2/15
to golang-nuts


On Wednesday, September 2, 2015 at 8:29:17 AM UTC-4, kartik wrote:

Calls to my server come from other servers(and not browser). Secondly, my server doesn't close the connection ever(for performance reasons), but clients do(whenever response time exceeds, say 200ms).

I haven't configured any "ReadTimeout"(or writeTimeout) in my server, I wonder why these goroutines are waiting for days.

Has anyone else faced a similar problem? Does this occur when the client closes the connection while the goroutine is still reading? The memory on my server has been increasing steadily as the # of goroutines that enter into this "wait" state keeps on increasing.

There's not much we can do without any code to see, but most likely it's not the server, and the http client is keeping connections open. Either you're failing to close every request and leaking connections, or it's possible that you're requesting from thousands of hosts, and maintaining a lot of keepalive connections. 

You should also still have timeouts on everything if you have a long running service. You can't guarantee the behavior of every client and server you're connected to. 

kartik

unread,
Sep 2, 2015, 10:05:08 AM9/2/15
to golang-nuts, jo...@claireit.nl
Thanks for the reply. I don't understand your point. This goroutine is spawned while serving the http request, by the http package. I am using httputil.ReverseProxy

Tamás Gulácsi

unread,
Sep 2, 2015, 10:44:02 AM9/2/15
to golang-nuts
Please set timeouts!

kartik

unread,
Sep 3, 2015, 6:33:38 AM9/3/15
to golang-nuts
Thanks for the reply.
Here is the code
// This is dynamic reverse proxy. It receives the backend host in X-purl http header and forwards the request to it.
// It gives backend server a time of 200ms to serve the request
Code has some un-necessary detail in it and is not-compileable. Mostly you can start reading from here
func (dp *DRProxy) ServeHTTP(w http.ResponseWriter, r *http.Request)


There's not much we can do without any code to see, but most likely it's not the server, and the http client is keeping connections open.
How do you conclude that it is the client which is causing the leak?
Either you're failing to close every request and leaking connections, or it's possible that you're requesting from thousands of hosts, and maintaining a lot of keepalive connections. 
I am maintaining a lot of keepalive connections.  But why is IO Wait showing so many minutes? Is the IO wait time shown in goroutine stack dump cummulative or does it reset ones the same connection get a subsequent request?

You should also still have timeouts on everything if you have a long running service. You can't guarantee the behavior of every client and server you're connected to. 
I'll put timeouts. Thanks for your help. Really appreciate it.
Thanks :) 

kartik

unread,
Sep 3, 2015, 6:37:21 AM9/3/15
to golang-nuts
netstat also show 20K connections on this server. However, what baffles me is that why memory is progressively increasing and why goroutine stack dump shows such high IO wait time. 

James Bardin

unread,
Sep 3, 2015, 10:17:59 AM9/3/15
to kartik, golang-nuts
On Thu, Sep 3, 2015 at 6:33 AM, kartik <ghaya...@gmail.com> wrote:
> Thanks for the reply.
> Here is the code
> http://play.golang.org/p/UMuGFFb2li
> // This is dynamic reverse proxy. It receives the backend host in X-purl
> http header and forwards the request to it.
> // It gives backend server a time of 200ms to serve the request
> Code has some un-necessary detail in it and is not-compileable. Mostly you
> can start reading from here
> func (dp *DRProxy) ServeHTTP(w http.ResponseWriter, r *http.Request)
>
>>
>> There's not much we can do without any code to see, but most likely it's
>> not the server, and the http client is keeping connections open.
>
> How do you conclude that it is the client which is causing the leak?

I don't, but it's the most common scenario, since it's easier to lose
track of the connections on the client side. In this case the "client"
is the ReverseProxy, which will ensure that the response body is
closed, so that shouldn't be a problem.

I don't understand why you have multiple ReverseProxies, each with
their own transport. Each one has to maintain it's own idle connection
pool, and and will multiply the number of connections that you keep
open.

A lot of your code seems to be around synchronizing multiple proxy
structs. It would be a lot simpler if you just use a single instance,
and it would make the use if idle connections a lot more efficient.

>>
>> Either you're failing to close every request and leaking connections, or
>> it's possible that you're requesting from thousands of hosts, and
>> maintaining a lot of keepalive connections.
>
> I am maintaining a lot of keepalive connections. But why is IO Wait showing
> so many minutes? Is the IO wait time shown in goroutine stack dump
> cummulative or does it reset ones the same connection get a subsequent
> request?

All of your idle connections need to be in IO wait in order to receive
data. I'd expect it to be very high maintaining lost of open
connections.

James Bardin

unread,
Sep 3, 2015, 10:21:01 AM9/3/15
to kartik, golang-nuts
On Thu, Sep 3, 2015 at 6:37 AM, kartik <ghaya...@gmail.com> wrote:
> netstat also show 20K connections on this server. However, what baffles me
> is that why memory is progressively increasing and why goroutine stack dump
> shows such high IO wait time.

Look at what you're connected to. Clients, servers, both? If clients
aren't releasing connections, then you will have to forcefully close
them. A network connection usually does more waiting than anything
else, so saying that the IO wait time is "high" doesn't help without
something to compare it to.

kartik

unread,
Sep 4, 2015, 2:06:56 AM9/4/15
to golang-nuts, ghaya...@gmail.com
Apologies for not explaining the problem in detail. My application is a Dynamic Reverse Proxy(lets call it DRP). The clients(lets call them CLIENTS) to DRP are servers(not browsers), which send http request with a header X-purl which contains the backend URL(e.g:- X-purl: http://<url>) to which DRP should forward the request to. Lets call backend servers as BSERVERS. DRP provides BSERVER a time of 200 milliseconds to respond to a request that DRP forwards it.

When I mentioned that netstat showed 20K connections, most of these connections are with the CLIENTS, not with BSERVERS.

The application memory keeps increasing(over a couple of weeks). A lot of goroutines ~20K are in IO wait state in the same call trace. Call trace is mentioned above. Even when I decrease the traffic on a server, the goroutines are not coming down.

Is there a real leak of connections or goroutines which are not being garbage collected? Is the IO wait time cumulative?
Are there 20K goroutines because DRP has 20K keep-alive connections with CLIENTS? Please note the CLIENTS also have timeout at their end(say 150ms) and if DRP doesn't respond within that time, CLIENT closes the connection.



I don't understand why you have multiple ReverseProxies, each with
their own transport. Each one has to maintain it's own idle connection
pool, and and will multiply the number of connections that you keep
open.

A lot of your code seems to be around synchronizing multiple proxy
structs. It would be a lot simpler if you just use a single instance,
and it would make the use if idle connections a lot more efficient.
 
Thanks for your patience of going through my code. The reason I am using multipe reverse-proxies(per URL) is because I observed heavy lock contention on the transport, i.e the lock which is taken for "putIdleConn" and "getIdleConn"(transport.go)


All of your idle connections need to be in IO wait in order to receive
data. I'd expect it to be very high maintaining lost of open
connections.
So are you saying that IO wait is cumulative? 

kartik

unread,
Sep 4, 2015, 2:11:02 AM9/4/15
to golang-nuts, ghaya...@gmail.com
When I said 20K connections on my applications, 99% of them are with the CLIENTS not Backend Servers. My application doesn't ever break connection with the Clients or Backend Servers (i.e Keep-Alive is on) but does maintain timeout(response_header_timeout) when forwarding requests to Backend Server and thus break connection with Backend Server if the response doesn't come within a given time). 

kartik

unread,
Sep 4, 2015, 2:14:46 AM9/4/15
to golang-nuts, ghaya...@gmail.com
I can set Read Timeout, but that doesn't explain the problem of increase in memory observed. I don't have problem with maintaining lots of idle connections(the idle connections will never reach more than 30K, since the clients are limited), but increase in memory is a cause of concern as I feel that there is something wrong happening which is causing certain leak.

Giulio Iotti

unread,
Sep 4, 2015, 6:38:37 AM9/4/15
to golang-nuts, ghaya...@gmail.com
On Friday, September 4, 2015 at 8:14:46 AM UTC+2, kartik wrote:
I can set Read Timeout, but that doesn't explain the problem of increase in memory observed. I don't have problem with maintaining lots of idle connections(the idle connections will never reach more than 30K, since the clients are limited), but increase in memory is a cause of concern as I feel that there is something wrong happening which is causing certain leak.

Hardly looked at the code, so excuse me if this is way off. But I noticed you append to pl.q but there is never a deletion from this array.

Could that be one thing that uses memory?

-- 
Giulio Iotti 

James Bardin

unread,
Sep 4, 2015, 9:48:58 AM9/4/15
to kartik, golang-nuts
On Fri, Sep 4, 2015 at 2:06 AM, kartik <ghaya...@gmail.com> wrote:
> Apologies for not explaining the problem in detail. My application is a
> Dynamic Reverse Proxy(lets call it DRP). The clients(lets call them CLIENTS)
> to DRP are servers(not browsers), which send http request with a header
> X-purl which contains the backend URL(e.g:- X-purl: http://<url>) to which
> DRP should forward the request to. Lets call backend servers as BSERVERS.
> DRP provides BSERVER a time of 200 milliseconds to respond to a request that
> DRP forwards it.
>
> When I mentioned that netstat showed 20K connections, most of these
> connections are with the CLIENTS, not with BSERVERS.
>
> The application memory keeps increasing(over a couple of weeks). A lot of
> goroutines ~20K are in IO wait state in the same call trace. Call trace is
> mentioned above. Even when I decrease the traffic on a server, the
> goroutines are not coming down.
>
> Is there a real leak of connections or goroutines which are not being
> garbage collected? Is the IO wait time cumulative?
> Are there 20K goroutines because DRP has 20K keep-alive connections with
> CLIENTS? Please note the CLIENTS also have timeout at their end(say 150ms)
> and if DRP doesn't respond within that time, CLIENT closes the connection.

If you have 20k idle connections with clients, you're going to have
20k goroutines in IO wait.

It sounds like the clients only close their connections if the request
takes too long, otherwise they maintain a keepalive connection to the
server. If there's too many client connections for your server, you
need to prune them yourself, or somehow have the clients use fewer
connections.


>>
>> I don't understand why you have multiple ReverseProxies, each with
>> their own transport. Each one has to maintain it's own idle connection
>> pool, and and will multiply the number of connections that you keep
>> open.
>>
>> A lot of your code seems to be around synchronizing multiple proxy
>> structs. It would be a lot simpler if you just use a single instance,
>> and it would make the use if idle connections a lot more efficient.
>
>
> Thanks for your patience of going through my code. The reason I am using
> multipe reverse-proxies(per URL) is because I observed heavy lock contention
> on the transport, i.e the lock which is taken for "putIdleConn" and
> "getIdleConn"(transport.go)

This is a known issue when using http.Transport under high load.
Multiple transports can help, but you don't need to have all the
orchestration of loaning them out and returning them to the pool for
each request, or having to manage pools of channels (which is probably
more overhead than allocating channels when you need them). The proxy
and Transports are safe to use concurrently, just pick one for each
request.

>> All of your idle connections need to be in IO wait in order to receive
>> data. I'd expect it to be very high maintaining lost of open
>> connections.
>
> So are you saying that IO wait is cumulative?
>

I'm not sure what you mean since by definition it has to be a sum of
the time spent in IO wait, but it doesn't matter if you don't have
anything to compare to.
Reply all
Reply to author
Forward
0 new messages