Counter update write timeouts with Datastax Driver/Native protocol, not with Astyanax/Thrift

Steven

unread,

Jun 15, 2016, 12:02:30 PM6/15/16

to DataStax Java Driver for Apache Cassandra User Mailing List, Eugen Dinca

We have a service that writes to a few legacy (pre-CQL) counter column families. We've been trying to migrate this service from Astyanax to the Datastax Java Driver (version 2.1.10.1). We've been testing the new version in a "shadow" deployment in a production environment, using the same Cassandra cluster as the production version, but writing to a testing-only keyspace.

Occasionally, unlogged batches of counter updates in the same partition will fail with the following error from the coordinator:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)

We've only observed these errors in the service version that uses the Datastax Driver, not the version that uses Astyanax.

These batches are written with CL=LOCAL_QUORUM; the CL in the error message doesn't match.This resembles the sypmptoms of the issue described in

CASSANDRA-10041 "timeout during write query at consistency ONE" when updating counter at consistency QUORUM and 2 of 3 nodes alive

In that issue, the error occurs when a node is abruptly terminated. However, we've also seen the error occur when all Cassandra nodes appeared to be healthy.

There are a few possible explanations for why the errors only occur with the Datastax driver, but I'm not sure which is correct:

a) There is a problem with how we're using the Datastax Driver to compose batches of counter updates

b) There is a difference in the between the implementation of counter updates in the Native protocol from the Thrift protocol such that the error is reported to native clients, but not to Thrift clients.

c) There is a difference between the keyspace/column family definition of the production and testing keyspaces.

d) The Astyanax/Thrift version is getting the error but is ignoring it for some reason.

I doubt (c) is the reason; we've made an effort to ensure that the keyspace and CF configurations are the same. Also, (d) seems unlikely because we've seen other errors (such as unavailable exceptions) reported correctly. So, I'm betting that either (a) or (b) is the reason.

Would someone please suggest which of these explanations is likely to be correct, and what we might do to avoid the problem?

Steven

unread,

Jun 15, 2016, 12:04:08 PM6/15/16

to DataStax Java Driver for Apache Cassandra User Mailing List, eu...@knewton.com

I forgot to mention that we are using Cassandra 2.1.11

Steven

unread,

Jun 27, 2016, 5:40:37 PM6/27/16

to DataStax Java Driver for Apache Cassandra User Mailing List, eu...@knewton.com

Someone on the Cassandra IRC channel suggested that these timeouts might be related to the following Cassandra bug:

https://issues.apache.org/jira/browse/CASSANDRA-11302

However, I'm not sure if this is the same as the issue we're facing. It's not clear why that bug might cause only native protocol requests to timeout.

I'd appreciate any help at all on this problem.

Reply all

Reply to author

Forward