Experiencing strange disconnect issue

1,722 views
Skip to first unread message

Bo Finnerup Madsen

unread,
Mar 17, 2016, 2:47:31 AM3/17/16
to DataStax Java Driver for Apache Cassandra User Mailing List
Hi,

We are currently trying to convert an existing java web application to use cassandra, and while most of it works great :) we have a "small" issue.

After some time, all connectivity seems to be lost and we get the following errors:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.61.70.107:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.61.70.107] Connection has been closed), /10.61.70.108:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.61.70.108] Connection has been closed))

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.61.70.107:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /10.61.70.108:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.61.70.108] Connection has been closed), /10.61.70.110:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.61.70.110] Connection has been closed))

The errors persists, and the application needs to be restarted to recover.

At application startup we create a cluster and a session which we reuse through out the application as pr. the documentation. We use the following code:
final Cluster cluster = Cluster.builder()
.addContactPoints(key.getContactPoints())
.withSocketOptions(new SocketOptions().setKeepAlive(true))
.withReconnectionPolicy(Policies.defaultReconnectionPolicy())
.withRetryPolicy(Policies.defaultRetryPolicy())
.build();
final Session session = cluster.connect(key.getKeyspace());
We have tried without any of the socketOptions/policies, and with them as above...same result.

I have uploaded the cassandra driver debug log here: https://gist.github.com/anonymous/db6fa061298018c46954
It all goes south at about line 433 (time 19:46:43) where the client throws a:
io.netty.handler.codec.DecoderException: com.datastax.driver.core.exceptions.DriverInternalError: Adjusted frame length exceeds 268435456: 462591744 - discarded

We are running cassandra tar ball in EC2 in a cluster of three machines. We have tried it with cassandra 3.0.3 and java driver 3.0.0, as well as cassandra 2.1.13 and driver 2.1.9.

I would greatly appreciate any ideas as to what we are doing wrong to experience this? :)

Thank you in advance!

Yours sincerely,
  Bo Madsen

Kant Kodali

unread,
Mar 17, 2016, 3:02:15 AM3/17/16
to java-dri...@lists.datastax.com
First of all you don't need all that default stuff. It's already done for you internally. 

secondly keep it really simple just to verify it is working. specifically, just give something like this below

Cluster cassandra = Cluster.builder()
.addContactPoint(configProperties.getProperty("host"))
        .withPort(Integer.parseInt(configProperties.getProperty("port")))
        .build(); // the default port should be 9042
final Session session = cluster.connect("keyspace name");
Finally, This may not fix your problem but it is still a good first step to see if it connects. Looking at your stack trace There are lot of reasons this can happen. Are you using any async stuff?





--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Bo Finnerup Madsen

unread,
Mar 17, 2016, 3:20:55 AM3/17/16
to java-dri...@lists.datastax.com
Hi Kant,

Thank you for the quick reply :)

We started with the vanilla config you describe, but got advised on the cassandra server mailing list to try the policy stuff. The result is the same with and without.
The application starts up fine, connects to the clust and starts to load data into it. Then after some time, seems a bit random, it looses connectivity.

We use quite a bit of executeAsync, could that be the culprit? I can try and do it all synchronous and see if that changes anything.

Kant Kodali

unread,
Mar 17, 2016, 3:28:52 AM3/17/16
to java-dri...@lists.datastax.com
connectAsync seems to have some issue for me as well but executeAsync is fine as long as you know how to use it correctly. NoHostAvailable exeption is pretty common with Async code when people don't know how to use it correctly. If you can change everything to synchronous then I would strongly advise to do so (very high chance your problem might go away). Let me know how it goes.


David Hayek

unread,
Mar 17, 2016, 8:32:24 AM3/17/16
to java-dri...@lists.datastax.com
Interesting. We experienced something similar in Google's cloud. It was only after a period of inactivity when the connections were lost. We ultimately had to change some low-level OS keep-alive values between the nodes (I don't know the details off hand-- not my area of expertise). Here's a link that might help.



This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.

Vishy Kasaravalli

unread,
Mar 17, 2016, 12:04:11 PM3/17/16
to java-dri...@lists.datastax.com
Bo,

The following lines from your log indicates that there is no schema agreement on your cassandra cluster. This means not all cassandra nodes have the same view of your tables. 

[20160316 19:00:04 DEBUG ] ControlConnection - Checking for schema agreement: versions are [71474df1-06ab-3373-af5e-9d1e3f3b737e, ecbe9448-afa8-3687-8a66-f6f665a5e731]

Under this condition, the data sent back by some nodes does not make sense to client driver. The line below is a symptom of that.

io.netty.handler.codec.DecoderException: com.datastax.driver.core.exceptions.DriverInternalError: Adjusted frame length exceeds 268435456: 462591744 - discarded

Fix is to ensure schema agreement. See this: http://wiki.apache.org/cassandra/FAQ#schema_disagreement

Bo Finnerup Madsen

unread,
Mar 18, 2016, 2:46:16 AM3/18/16
to java-dri...@lists.datastax.com
Hi Kant,

Switching from executeAsync to plain execute fixed the issue :) Thank you!

However I am not sure I understand why...could it be a data size issue, where we are trying to execute too much simultaneous?
The application is writing a lot of data to the cluster, and the cluster nodes are quite loaded at the time (load > 10).

Bo Finnerup Madsen

unread,
Mar 18, 2016, 2:54:56 AM3/18/16
to java-dri...@lists.datastax.com
Hi David,

Thank you for chiming in :)

Idle disconnects where also my initial suspicion after googling the errors I got, so I updated all the OS keep alive settings according to the datastax manual.

But it is my understanding that there is no firewall between the nodes in EC2, so in theory there should be nothing to drop idle connections. Also the issues happen while the application is very active writing data to cluster.

Bo Finnerup Madsen

unread,
Mar 18, 2016, 3:02:04 AM3/18/16
to java-dri...@lists.datastax.com
Hi Vishy,

Thank you for sharing your opinion! I am really grateful for all the helpful people on this list :)

Cassandra is very new to me, so I have execute both "nodetool status" and "nodetool describecluster" many times during these experiments, and never seen the split schema described.

When the application is started, it creates it's keyspace and associated tables. Then it proceeds with synchronizing data into these tables from several external systems without errors. Then between 30 minutes and 2 hours into the synchronization, we get the error described.
It is possible that the cluster doesn't agree on the schema after so long?

Kant Kodali

unread,
Mar 18, 2016, 3:08:30 AM3/18/16
to java-dri...@lists.datastax.com
there you go!! with executeAsync there seems to be a lock contention. By the time the callback wants to write the data back there is no connection. hence it complains "No hosts available exception"!. 

Kant Kodali

unread,
Mar 18, 2016, 3:13:51 AM3/18/16
to java-dri...@lists.datastax.com
sorry I meant with connectAsync(there is a lock contention) but since you use connect and executeAsync there is a chance you are not using the callback correctly but since you changed all to sync you problem disappeared. once again executeAsync should work fine (It is very likely something is wrong with your callback structure)


Bo Finnerup Madsen

unread,
Mar 18, 2016, 5:14:45 AM3/18/16
to java-dri...@lists.datastax.com
Hi Kant,

That might very well be the case :)

We are rewriting an existing application to use cassandra, so we use executeAsyc where we would normally use batch inserts/updates.
Specifically we have the following procedures:
public static List<ResultSet> executeAsyncAndWait(Session session, Stream<? extends Statement> statements) {
try {
List<ResultSetFuture> operations = Lists.newArrayList();

statements.forEach(st -> {
operations.add(session.executeAsync(st));
});
return waitForCompletion(operations);
} catch(RuntimeException ex) {
throw new RuntimeException("Got exception querying cassandra", ex);
}
}
public static List<ResultSet> waitForCompletion(List<ResultSetFuture> operations) {
try {
return Futures.allAsList(operations).get();
} catch (InterruptedException e) {
throw new RuntimeException(e);
} catch (ExecutionException e) {
throw Throwables.propagate(e.getCause());
}
}

While I can imagine that this is not the way that executeAsync was meant to be used, I cannot see anything obvious wrong?


Kant Kodali

unread,
Mar 18, 2016, 5:52:03 AM3/18/16
to java-dri...@lists.datastax.com
okay this code definetly does not look right to me!! and hence your problem!!

First of all here is some theory Future represents a computation in progress so doing Future.get() doesn't guarantee that the computation is done therefore you always need to do .get() inside a loop checking for .isDone() when you use the standard JAVA API's but I can see that you are using Futures which is from google's guava library however you are not using it right.

Futures.allAsList(operations).get(); // This is indeed your big problem

Now this returns a ListenableFuture but you are not making the best use of it.  If you want to do .get() you still need to wrap it inside a loop because ListenableFuture extends Future but there is a much better way which requires you to refactor a good amount of your code. I would recommend getting something like ListenableFuture<ResultSet> than ResultSetFuture or ListenableFuture<Whatever> and then you can do something like

Futures.addCallback(listenableFuture, new FutureCallback<Response>() {
     public void onSuccess(Response response) { //do whatever}
})






Reply all
Reply to author
Forward
0 new messages