FD leak in Go lang proxy server which leads to http: Accept error: accept tcp [::]:8080: too many open files; retrying in 10ms

820 views
Skip to first unread message

vendra...@gmail.com

unread,
Jul 17, 2015, 10:57:27 AM7/17/15
to golan...@googlegroups.com
Hi All,

This is my first post. So please excuse if I break any forum rules :-)

Background:
We are running a proxy server in golang which takes request from client, fetches the content client requested for and sends a response back. We run the server on port 8080.
Go version 1.2.1 .Code runs on CentOs6. TCP parameters tuned for faster use of connections.
Processe's Max open file limit set to 200000

Issue:
Once in every 3-5 days, server gets in to unresponsive state saying:
Accept error: accept tcp [::]:8080: too many open files; retrying in 1s
Accept error: accept tcp [::]:8080: too many open files; retrying in 1s

A manual restart is required every time we see this.

Suspicious findings:
-> When we observed lsof output in our internal testing, some connections are there for ever. And are increasing randomly. So we suspect there is an FD leak and this number goes way higher in real world case.
Ex:
agent 13771 agent    7u  IPv6 75640925      0t0      TCP 128.199.211.246:webcache->123.63.202.169:57585 (ESTABLISHED)
agent 13771 agent    9u  IPv4 75636003      0t0      TCP 10.130.203.8:37627->10.130.203.8:6379 (ESTABLISHED)
agent 13771 agent   10u  IPv6 75640287      0t0      TCP 128.199.211.246:webcache->222.186.129.5:fotogcad (ESTABLISHED)
agent 13771 agent   11u  IPv6 75912851      0t0      TCP 128.199.211.246:webcache->222.186.129.5:ieee-mih (ESTABLISHED)
agent 13771 agent   12u  IPv6 76102384      0t0      TCP 128.199.211.246:webcache->123.63.202.169:56670 (ESTABLISHED)
agent 13771 agent   13u  IPv6 76080513      0t0      TCP 128.199.211.246:webcache->123.63.202.169:49662 (ESTABLISHED)
agent 13771 agent   14u  IPv4 75645718      0t0      TCP 10.130.203.8:37962->10.130.203.8:6379 (ESTABLISHED)
agent 13771 agent   15u  IPv6 76080655      0t0      TCP 128.199.211.246:webcache->123.63.202.169:49666 (ESTABLISHED)
agent 13771 agent   18u  IPv6 76080514      0t0      TCP 128.199.211.246:webcache->123.63.202.169:49663 (ESTABLISHED)

-> In production lsof output, we found these lines (some hundreds)
agent 15976 agent 1563u  sock                0,6      0t0 39770995 can't identify protocol

->In server handler, we are setting connection to close
      w.Header().Set("Connection", "close")

   This is when we fetch the content.
      page, err := GetPage(origURL)
      defer page.Body.Close()
                   
    This how we listen
        l, e := net.Listen(proto, srv.Addr)

These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write. 

Let me know how to identify the problem and pointers to solve this. Not sure how to reproduce the issue. Thanks in advance.

James Bardin

unread,
Jul 17, 2015, 11:26:29 AM7/17/15
to golan...@googlegroups.com


On Friday, July 17, 2015 at 10:57:27 AM UTC-4, Adithya Vendra wrote:


These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write. 


You seemed to have answered your own question. When you have a long running server, you need timeouts on everything. The http.Server, http.Client, and http.Transport (including the Transport.Dial function) all have applicable settings. 

You also need to update your go version. Besides numerous other changes, there have been some related bugs fixed in net/http.

Once you have an updated version of Go, and reasonable timeouts; if you're still losing track of connections then we will need to see more specific code to reproduce. (FYI, the `can't identify protocol` output is usually from not closing a connection in your code, leaving an open FD where the socket has been cleaned up, so there's no way to identify the protocol)

Adithya Vendra

unread,
Jul 20, 2015, 2:12:20 AM7/20/15
to golan...@googlegroups.com


On Friday, July 17, 2015 at 8:56:29 PM UTC+5:30, James Bardin wrote:


On Friday, July 17, 2015 at 10:57:27 AM UTC-4, Adithya Vendra wrote:


These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write. 


You seemed to have answered your own question. When you have a long running server, you need timeouts on everything. The http.Server, http.Client, and http.Transport (including the Transport.Dial function) all have applicable settings. 


Thanks for the answer :-) I will put some timeouts on all the aspects you mentioned and will check again. If you have any idea on reproducing the issue, that will be helpful :-)  Thanks again.

Roger Pack

unread,
Jul 20, 2015, 1:26:10 PM7/20/15
to golan...@googlegroups.com
I assume to repro the problem you'd want to connect a lot of clients and have them "wait forever"... [in this case, probably what's happening is the clients connection is being aborted, but until you send something to verify the connection is still good [like a timeout ping] then you won't detect it, so the number of connections grows forever...].

 
Reply all
Reply to author
Forward
0 new messages