FD leak in Go lang proxy server which leads to http: Accept error: accept tcp [::]:8080: too many open files; retrying in 10ms

vendra...@gmail.com

unread,

Jul 17, 2015, 10:57:27 AM7/17/15

to golan...@googlegroups.com

Hi All,

This is my first post. So please excuse if I break any forum rules :-)

Background:

We are running a proxy server in golang which takes request from client, fetches the content client requested for and sends a response back. We run the server on port 8080.

Go version 1.2.1 .Code runs on CentOs6. TCP parameters tuned for faster use of connections.

Processe's Max open file limit set to 200000

Issue:

Once in every 3-5 days, server gets in to unresponsive state saying:

Accept error: accept tcp [::]:8080: too many open files; retrying in 1s

A manual restart is required every time we see this.

Suspicious findings:

-> When we observed lsof output in our internal testing, some connections are there for ever. And are increasing randomly. So we suspect there is an FD leak and this number goes way higher in real world case.

Ex:

agent 13771 agent 7u IPv6 75640925 0t0 TCP 128.199.211.246:webcache->123.63.202.169:57585 (ESTABLISHED)

agent 13771 agent 9u IPv4 75636003 0t0 TCP 10.130.203.8:37627->10.130.203.8:6379 (ESTABLISHED)

agent 13771 agent 10u IPv6 75640287 0t0 TCP 128.199.211.246:webcache->222.186.129.5:fotogcad (ESTABLISHED)

agent 13771 agent 11u IPv6 75912851 0t0 TCP 128.199.211.246:webcache->222.186.129.5:ieee-mih (ESTABLISHED)

agent 13771 agent 12u IPv6 76102384 0t0 TCP 128.199.211.246:webcache->123.63.202.169:56670 (ESTABLISHED)

agent 13771 agent 13u IPv6 76080513 0t0 TCP 128.199.211.246:webcache->123.63.202.169:49662 (ESTABLISHED)

agent 13771 agent 14u IPv4 75645718 0t0 TCP 10.130.203.8:37962->10.130.203.8:6379 (ESTABLISHED)

agent 13771 agent 15u IPv6 76080655 0t0 TCP 128.199.211.246:webcache->123.63.202.169:49666 (ESTABLISHED)

agent 13771 agent 18u IPv6 76080514 0t0 TCP 128.199.211.246:webcache->123.63.202.169:49663 (ESTABLISHED)

-> In production lsof output, we found these lines (some hundreds)

agent 15976 agent 1563u sock 0,6 0t0 39770995 can't identify protocol

->In server handler, we are setting connection to close

w.Header().Set("Connection", "close")

This is when we fetch the content.

page, err := GetPage(origURL)

defer page.Body.Close()

This how we listen

l, e := net.Listen(proto, srv.Addr)

These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write.

Let me know how to identify the problem and pointers to solve this. Not sure how to reproduce the issue. Thanks in advance.

James Bardin

unread,

Jul 17, 2015, 11:26:29 AM7/17/15

to golan...@googlegroups.com

On Friday, July 17, 2015 at 10:57:27 AM UTC-4, Adithya Vendra wrote:

These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write.

You seemed to have answered your own question. When you have a long running server, you need timeouts on everything. The http.Server, http.Client, and http.Transport (including the Transport.Dial function) all have applicable settings.

You also need to update your go version. Besides numerous other changes, there have been some related bugs fixed in net/http.

Once you have an updated version of Go, and reasonable timeouts; if you're still losing track of connections then we will need to see more specific code to reproduce. (FYI, the `can't identify protocol` output is usually from not closing a connection in your code, leaving an open FD where the socket has been cleaned up, so there's no way to identify the protocol)

Adithya Vendra

unread,

Jul 20, 2015, 2:12:20 AM7/20/15

to golan...@googlegroups.com

On Friday, July 17, 2015 at 8:56:29 PM UTC+5:30, James Bardin wrote:

On Friday, July 17, 2015 at 10:57:27 AM UTC-4, Adithya Vendra wrote:

These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write.

You seemed to have answered your own question. When you have a long running server, you need timeouts on everything. The http.Server, http.Client, and http.Transport (including the Transport.Dial function) all have applicable settings.

Thanks for the answer :-) I will put some timeouts on all the aspects you mentioned and will check again. If you have any idea on reproducing the issue, that will be helpful :-) Thanks again.

Roger Pack

unread,

Jul 20, 2015, 1:26:10 PM7/20/15

to golan...@googlegroups.com

I assume to repro the problem you'd want to connect a lot of clients and have them "wait forever"... [in this case, probably what's happening is the clients connection is being aborted, but until you send something to verify the connection is still good [like a timeout ping] then you won't detect it, so the number of connections grows forever...].

Reply all

Reply to author

Forward