High number of network connections from APP servers to a 3-node Cassandra cluster

Dongfeng Lu

unread,

Nov 16, 2015, 3:28:10 PM11/16/15

to DataStax Java Driver for Apache Cassandra User Mailing List

We have an AWS environment where there are 2 APP servers and a Cassandra cluster of 3 nodes. We have an application that has been running well for almost 3 weeks (as far as we can tell from the log) after the deployment of new codes. However, it started to behave badly since Nov. 8 that after it works for a while, it started to generate the following exceptions for every requests.

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
        at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:257)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:173)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)

What we have observed is that the number of connections from any of the 2 APP servers to Cassandra Nodes (cass1, cass2, cass3) are very large, like

$ netstat | grep -c -i cass
4007

When we restart the application, it works fine, and this number of connections starts from a small number like 200. However it steadily climbs to some high number and once it reaches some limit, the application starts to continuously report NoHostAvailableException errors.

We have the similar setup in 2 other AWS environments and 2 other private networks, and they are all running OK. In one private environment with a much heavier loads, the number of connections from APP server to Cassandra cluster stays steady at 500 or so. Of course, the app server in this private environment is more powerful (8 cores, 16GB memory) as compared to those in this AWS environment (2 cores, 8GB memory), but in terms of load difference, it seems reasonable.

So what could cause this high number of connections? I think I am using a single Session object to handle all requests, but I do have 10 - 20 threads sharing the Session object. Will it cause such a high number? How do I check the number of network connections using Java Driver? And how do I check how many Session objects I have for the cluster?

We are using Cassandra 2.0.6, and Java Driver 2.0.8. I saw https://datastax-oss.atlassian.net/browse/JAVA-425, but not sure if upgrading to 2.0.10 will solve the issue, as 2.0.8 has been working for us in other environments.

Any help is appreciated.

Olivier Michallat

unread,

Nov 18, 2015, 7:00:22 AM11/18/15

to java-dri...@lists.datastax.com

Hi,

This looks a lot like JAVA-419. Does your connection pool have a variable size, i.e. core connections != max connections? (if you haven't customized anything in PoolingOptions, the answer is yes). In 2.0.6, there is a bug where the pool will go into a loop opening and immediately closing a connection, leading to the high number.

This was fixed in 2.0.10. If you upgrade, I'd recommend to use the latest version, 2.0.12.

If for some reason you can't upgrade, the workaround is to force the pool to a fixed size, for example:

Cluster.builder().withPoolingOptions(

new PoolingOptions()

.setCoreConnectionsPerHost(HostDistance.LOCAL, 8)

.setMaxConnectionsPerHost(HostDistance.LOCAL, 8)

.setCoreConnectionsPerHost(HostDistance.REMOTE, 2)

.setMaxConnectionsPerHost(HostDistance.REMOTE, 2)

)

--

Olivier Michallat

Driver & tools engineer, DataStax

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Dongfeng Lu

unread,

Nov 19, 2015, 11:35:57 PM11/19/15

to DataStax Java Driver for Apache Cassandra User Mailing List

Hi Oliver,

Thank you very much. It is similar to JAVA-419, but our connections are all ESTABLISHED, with few TIME_WAIT. I am not sure if that makes any difference. Yes, we are using the default PoolingOptions.

We did two things in this AWS environment. We first resized the instance from m1.large to m3.xlarge. This alone seems to reduce the number of network connections to a much lower number, around 200 for about 20 hours. We then deployed the app with Java Driver 2.0.10, and the number of network connections stayed pretty low, with ESTABLISHED fluctuating around 91, and about 20 - 100 in TIME_WAIT from time to time. Note that I also modified the app by adding half a second wait after each exception to give the network a break, which might also helped to reduce the number of connections.

However, I still see a lot of NoHostAvailableException, averaging 50 per hour. In addition, I also see a lot of "Error creating pool to" errors, averaging about 71 per hour, which implies that the connection pool is closed and re-created very often. I don't remember seeing this error in 2.0.8. Is this new to 2.0.10? Does this mean there is still a network problem?

In a separate AWS environment, I deployed both the old app with 2.0.8 and modified app with 2.0.10 on the same machine and turned on the TRACE logging. Both apps connect to the same Cassandra cluster (different from the above environment). After they both ran for a day, I saw the following

Java Driver Version	2.0.8	2.0.10

network connection by netstat -p \| grep -i ndb \| grep -c PID	3382	91
NoHostAvailableException / hour	182	2370

Error creating pool / hour	0	0
Creating connection (by counting "Transport initialized and ready" in 2.0.8 or "Connection opened successfully" in 2.0.10) / hour	1633	1900
closing connection / hour	1898	1421

The numbers of network connections are in line with what we talked above. However, the number of NoHostAvailableExceptions is about 10 times higher for 2.0.10 than for 2.0.8. I don't know if that makes sense since 2.0.10 is supposed to fix the NoHostAvailableException problem. I also looked at the opening and closing of network connections, and both versions are comparable, at about 1700 times per hour. I am not a network expert, but this number seems too high.

In the other private environments with a heavier loads, I did not observe any NoHostAvailableException for these couple of days.

Is AWS networking really that bad? What options/values should I set for this kind of environments? Since the connection issue seems to be transient, I planed to catch the exception and retry the queries after a small delay so at least the app is able to finish the queries correctly. Any other good suggestions?

uskratos

unread,

Nov 23, 2015, 10:37:52 PM11/23/15

to DataStax Java Driver for Apache Cassandra User Mailing List

I am facing a similar situation.
Running a load test, immediately generates the following error:

All host(s) tried for query failed (tried: /10.82.51.72:9042 (com.datastax.driver.core.TransportException: [/10.82.51.72:9042] Connection has been closed), /10.82.51.73:9042 (com.datastax.driver.core.TransportException: [/10.82.51.73:9042] Connection has been closed), /10.82.51.71:9042 (com.datastax.driver.core.TransportException: [/10.82.51.71:9042] Connection has been closed)).

I am using a 3 node cluster, all in the same ESXi host so network latency is not an issue (avg. 0.186 ms).

My connection is built as follows:

CodecRegistry codecRegistry = new CodecRegistry();
DCAwareRoundRobinPolicy fapolicy = new Builder().withLocalDc(aps.DC).build();

            nosql_cl = Cluster
                    .builder()
                    .addContactPointsWithPorts(aps.CSClusterNodes)
                    .withRetryPolicy(DefaultRetryPolicy.INSTANCE)
                    .withProtocolVersion(ProtocolVersion.NEWEST_SUPPORTED)
                    .withPoolingOptions(
                        new PoolingOptions()
                            .setCoreConnectionsPerHost(HostDistance.LOCAL, 48)
                            .setMaxConnectionsPerHost(HostDistance.LOCAL, 48)
                            .setCoreConnectionsPerHost(HostDistance.REMOTE, 24)
                            .setMaxConnectionsPerHost(HostDistance.REMOTE, 24)
                    )
                    .withCodecRegistry(codecRegistry)
                    .withReconnectionPolicy(
                            new ConstantReconnectionPolicy(5000))
                            .withLoadBalancingPolicy(new TokenAwarePolicy(fapolicy))
                            .withCredentials(aps.CSU, aps.CSP).build();

Cassandra version: 3.0.0
Java driver version: 3.0.0-beta1

In average I see about 192 connections / node on port while the test is running. Should I lower the MaxConnectionsPerHost variable?

Alex Popescu

unread,

Nov 24, 2015, 12:37:40 AM11/24/15

to java-dri...@lists.datastax.com

What led you to choosing those numbers of connections per host?

--

You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

--

Bests,

Alex Popescu | @al3xandru

Sen. Product Manager @ DataStax

uskratos

unread,

Nov 24, 2015, 10:03:34 AM11/24/15

to DataStax Java Driver for Apache Cassandra User Mailing List

Nothing in particular as I haven't seen a default recommended value for either CoreConnectionsPerHost or MaxConnectionsPerHost. My initial tests were using 100 connections, so setting the value to 48 X 3 cassandra hosts would have resulted in a higher value than my 100 target . Yet it was still failing. I increased my load test to 200 connections. The result is the same. It works fine if I only have one connection. I can "play" with 5, or 10 or 50 and see if I still get the same error.

Alex Popescu

unread,

Nov 25, 2015, 5:14:01 AM11/25/15

to java-dri...@lists.datastax.com

On Tue, Nov 24, 2015 at 7:03 AM, uskratos <uskr...@gmail.com> wrote:

Nothing in particular as I haven't seen a default recommended value for either CoreConnectionsPerHost or MaxConnectionsPerHost. My initial tests were using 100 connections, so setting the value to 48 X 3 cassandra hosts would have resulted in a higher value than my 100 target . Yet it was still failing. I increased my load test to 200 connections. The result is the same. It works fine if I only have one connection. I can "play" with 5, or 10 or 50 and see if I still get the same error.

PoolingOptions already contains default values for both core and max. I'd suggest starting with those and then tune them later based on your findings.

Reply all

Reply to author

Forward