OpenTSDB server encounters "java.io.IOException: Too many open files"

495 views
Skip to first unread message

David Yu

unread,
Feb 18, 2016, 2:59:46 PM2/18/16
to OpenTSDB
Hi,

Our OpenTSDB instances kept getting the following exceptions and hung:

[AbstractNioSelector.warn] - Failed to accept a connection.

 

java
.io.IOException: Too many open files

        at sun
.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[na:1.7.0_80]

        at sun
.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) ~[na:1.7.0_80]

        at org
.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100) [netty-3.9.4.Final.jar:na]

        at org
.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [netty-3.9.4.Final.jar:n



The OpenTSDB cluster is separate from our HBase cluster, and has a very low request traffic (< 400 per hour). Not sure how the file descriptor limit was reached with this kind of traffic.


Has anyone seen this issue before?


Thanks,

David

David Yu

unread,
Feb 18, 2016, 3:00:22 PM2/18/16
to OpenTSDB
Also:

$ ulimit -H -n

 

4096

Jonathan Creasy

unread,
Feb 18, 2016, 4:22:07 PM2/18/16
to David Yu, OpenTSDB
That's a pretty low limit. What descriptors are open? Anything there that seems unusual?

Have you considered setting it to 10k?

David Yu

unread,
Feb 18, 2016, 6:51:56 PM2/18/16
to Jonathan Creasy, OpenTSDB
I did the following:

$ lsof -a -p 28147 

...

java    28147 opentsdb  391u  sock                0,7      0t0 130995 can't identify protocol

java    28147 opentsdb  392u  sock                0,7      0t0 131043 can't identify protocol

java    28147 opentsdb  393u  sock                0,7      0t0 131061 can't identify protocol

java    28147 opentsdb  394u  sock                0,7      0t0 131027 can't identify protocol

java    28147 opentsdb  395u  sock                0,7      0t0 132157 can't identify protocol

java    28147 opentsdb  396u  sock                0,7      0t0 132103 can't identify protocol

java    28147 opentsdb  397u  sock                0,7      0t0 132121 can't identify protocol

java    28147 opentsdb  398u  sock                0,7      0t0 132139 can't identify protocol

java    28147 opentsdb  399u  sock                0,7      0t0 132196 can't identify protocol

...


There are whole bunch of these open. 

Any chance TSDB is leaking file descriptors? Or does it have anything to do with how the httpClient queries the OpenTSDB server (not closing Response, etc)?


Jonathan Creasy

unread,
Feb 18, 2016, 8:00:37 PM2/18/16
to David Yu, OpenTSDB

Are your clients connecting, sending a tiny bit, and disconnecting?

Or do they connect, stay connected, stream lots-o-metrics?

David Yu

unread,
Feb 18, 2016, 8:06:39 PM2/18/16
to Jonathan Creasy, OpenTSDB
The client just do simple queries through POST queries. No writes:

    public List<OpenTsdbResult> queryOpenTsdb(OpenTsdbQuery query) throws WebApplicationException {
        Response response = rootWebTarget.path(QUERY_URL)
                .request().post(Entity.entity(query, MediaType.APPLICATION_JSON));

        if (response.getStatus() != Response.Status.OK.getStatusCode()) {
            throw new WebApplicationException(response);
        }
        return response.readEntity(new GenericType<List<OpenTsdbResult>>() {});
    }



Jonathan Creasy

unread,
Feb 18, 2016, 9:01:39 PM2/18/16
to David Yu, OpenTSDB

How do the writes happen?

Other nodes I guess?

I will look and see if we have a similar symptom. Our default ulimit is 10k, so we might not have noticed it.

ManOLamancha

unread,
Feb 18, 2016, 11:40:14 PM2/18/16
to OpenTSDB, jona...@ghostlab.net
On Thursday, February 18, 2016 at 5:06:39 PM UTC-8, David Yu wrote:
The client just do simple queries through POST queries. No writes:

    public List<OpenTsdbResult> queryOpenTsdb(OpenTsdbQuery query) throws WebApplicationException {
        Response response = rootWebTarget.path(QUERY_URL)
                .request().post(Entity.entity(query, MediaType.APPLICATION_JSON));

        if (response.getStatus() != Response.Status.OK.getStatusCode()) {
            throw new WebApplicationException(response);
        }
        return response.readEntity(new GenericType<List<OpenTsdbResult>>() {});
    }

Yeah it sounds like the client may have stopped without disconnecting properly. If you "netstat -an | grep <TSDB PORT>" I bet you'll see a ton of TIMED_WAIT connections. OpenTSDB will close the idle connections after so many seconds (I need to look it up), otherwise the OS will close them after hours go by. So if you can find out what your "rootWebTarget"  client is doing and make sure to either use keep-alive or call .close() on it before existing the JVM or applet or whatever, that would help. Also, like Jonathan said, check your write utilities as they may be leaving connections open on exit. (I've seen that a lot with Perl scripts)

David Yu

unread,
Feb 19, 2016, 2:18:53 AM2/19/16
to ManOLamancha, OpenTSDB, Jonathan Creasy
On my client side, everything looks healthy. I issued the "netstat" command on the client machines and all of them have just a handful of CLOSE_WAIT connections (which should turned to CLOSED eventually). 

On the OpenTSDB side, however, I'm seeing 30 something CLOSE_WAIT connections. These OpenTSDB servers has been taken off of the ELB (for investigation), which has no incoming connections. So the connections in CLOSE_WAIT seemed to be stuck in that state.

Our clients do seem to get exceptions while sending post requests:

java.net.SocketException: Unexpected end of file from server
    at org.glassfish.jersey.client.HttpUrlConnector
    ....

This is thrown before closing the connections. 

At this point, we only serve static data from OpenTSDB, which we bulk imported. So no writes to the server at all.

ManOLamancha

unread,
Feb 19, 2016, 7:17:11 PM2/19/16
to OpenTSDB, clars...@gmail.com, jona...@ghostlab.net
On Thursday, February 18, 2016 at 11:18:53 PM UTC-8, David Yu wrote:
On my client side, everything looks healthy. I issued the "netstat" command on the client machines and all of them have just a handful of CLOSE_WAIT connections (which should turned to CLOSED eventually). 

On the OpenTSDB side, however, I'm seeing 30 something CLOSE_WAIT connections. These OpenTSDB servers has been taken off of the ELB (for investigation), which has no incoming connections. So the connections in CLOSE_WAIT seemed to be stuck in that state.

Our clients do seem to get exceptions while sending post requests:

java.net.SocketException: Unexpected end of file from server
    at org.glassfish.jersey.client.HttpUrlConnector
    ....

This is thrown before closing the connections. 

At this point, we only serve static data from OpenTSDB, which we bulk imported. So no writes to the server at all.

Hmm,  I wonder if ELB is doing something weird then. Another thing to check are the sockets to HBase. Make sure the region servers aren't falling over and coming back online all the time. Also maybe try hitting one of the TSDs directly, bypassing the ELB to see if that helps at all. But it does sound fishy that Jersey is throwing the "Unexpected end of file from server". Another thing to check would be to tcpdump the traffic to your Jersey app and see if all of the data is indeed making it back.
Reply all
Reply to author
Forward
0 new messages