Java driver (2.1.4) did not recover from bad connection

Jose Fernandez

unread,

Jul 17, 2015, 3:31:28 PM7/17/15

to java-dri...@lists.datastax.com

Hi all,

We have an application running in 12 different hosts. Each one uses a single instance of the Java driver to write to our Cassandra cluster.

Earlier this week something happened in our cluster that made the drivers in 4 of the 12 boxes to start spewing the following errors:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)

We did a rolling restart of the entire Cassandra cluster, but that didn't help. The issue only went away when we restarted the application and refreshed the client session.

This is the first time we've encountered this issue. We're using the 2.1.4 version of the client and haven't updated it in a while. Since this only happened in some of the app servers, we think this is an issue where the client's session went stale but it wasn't able to detect this and refresh it. Does this sound like a bug that we might have fixed in the newest releases?

Andrew Tolbert

unread,

Jul 17, 2015, 4:46:57 PM7/17/15

to java-dri...@lists.datastax.com

Hi Jose,

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)

This is a timeout surfaced from cassandra to the client which indicates a Cassandra coordinator could not get an acknowledgement from any replicas owning that data within write_request_timeout_in_ms.

When you were getting the WriteTimeoutExceptions, were they all coming from the same Cassandra host or was it pretty evenly distributed among your cluster?

One possibility I'm thinking of is that maybe your clients were directing a large portion of their connections to a single C* node or set of C* nodes. What load balancing policy do you have configured?

As far as newer releases helping, the first thing that comes to mind is the fix to revert JAVA-425 (JAVA-617) in 2.1.6, in that maybe hosts are marked down and not able to be marked back up because executor attempts are blocked on waiting to resolve the suspect state, but that doesn't seem likely in this case (I would expect less activity and no timeouts if that were the case). We have done some work since 2.1.4 in terms of optimizing pool establishment and such, but I'm not confident of anything that would resolve your problem.

We did a rolling restart of the entire Cassandra cluster, but that didn't help. The issue only went away when we restarted the application and refreshed the client session.

How did you go about restarting your applications? When you restarted 1 was it only that one that recovered? Or did you restart them all at the same time? It might be possible a temporary reprieve in load from your application may have given your C* cluster enough time to recover from whatever high load state it was in.

Thanks!

Andy

Jose Fernandez

unread,

Jul 17, 2015, 4:59:32 PM7/17/15

to java-dri...@lists.datastax.com

Andrew,

Our exception logging (New Relic) didn't include the Cassandra host, just the message and stack trace. Is this metadata available in the exception itself? If so, I could modify our error handling to append that information to the error message.

I'm using the default load balancing policy (round robin?).

I restarted each box one at a time. The errors disappeared as soon as the box came back up and the others didn't seem to be affected until they were restarted too.

Another team at my company said they ran into a similar issue with 2.1.4 and the problem went away on 2.1.6. I was also looking at the revert of 425 that you mentioned.

Andrew Tolbert

unread,

Jul 17, 2015, 5:45:35 PM7/17/15

to java-dri...@lists.datastax.com

Hi Jose,

Ah, you are right. That information is not surfaced in the exception. We have an issue open targeting the next 2.0.x and 2.1.x releases (JAVA-720) so the host is at least available on the exception (not sure about it being logged, but we probably should).

I restarted each box one at a time. The errors disappeared as soon as the box came back up and the others didn't seem to be affected until they were restarted too.

That is good to know, that could mean there is some issue with the state of the hosts, in which case 2.1.6 or 2.1.7.1 would be worth a look.

Another team at my company said they ran into a similar issue with 2.1.4 and the problem went away on 2.1.6. I was also looking at the revert of 425 that you mentioned.

Ah, good to hear that worked out for them. You can feel pretty safe upgrading to 2.1.6 there aren't any regressions between those versions that we are aware of.