Driver gets wedged and can't connect to any servers

Nate Mueller

unread,

Dec 14, 2016, 6:17:06 PM12/14/16

to DataStax Java Driver for Apache Cassandra User Mailing List

I've got a client with a fairly active connection to a two node ring. Normally works fine, but once or twice a month the driver code gets wedged. It disconnects from all the nodes in the ring and then thinks it can't reconnect to them. It continues to retry and fails. When I restart the process it comes up and connects fine. I don't think the cluster is actually dropping offline at any point. I have several of these services running on different machines on the same subnet and while they all exhibit the issue they don't trigger at the same time.

I assume this isn't very actionable but it happens frequently enough that I should be able to collect more information if I know what would be helpful. Version 3.0.0. Here's the relevant log lines:

WARN [2016-12-11 20:26:50,895] cluster1-connection-reaper-0 - com.datastax.driver.core.Connection Forcing termination of Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-55, inFlight=0, closed=true]. This should not happen and is likely a bug, please report.

WARN [2016-12-11 20:28:06,615] cluster1-connection-reaper-0 - com.datastax.driver.core.Connection Forcing termination of Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-56, inFlight=0, closed=true]. This should not happen and is likely a bug, please report.

ERROR [2016-12-11 20:28:52,695] cluster1-reconnection-0 - com.datastax.driver.core.ControlConnection [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds

ERROR [2016-12-11 20:29:52,587] cluster1-reconnection-0 - com.datastax.driver.core.ControlConnection [Control connection] Cannot connect to any host, scheduling retry in 2000 milliseconds

WARN [2016-12-11 20:30:02,101] cluster1-connection-reaper-0 - com.datastax.driver.core.Connection Forcing termination of Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.160.8:9042-52, inFlight=0, closed=true]. This should not happen and is likely a bug, please report.

WARN [2016-12-11 20:31:58,077] cluster1-connection-reaper-0 - com.datastax.driver.core.Connection Forcing termination of Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-58, inFlight=0, closed=true]. This should not happen and is likely a bug, please report.

WARN [2016-12-11 20:32:15,447] cluster2-connection-reaper-0 - com.datastax.driver.core.Connection Forcing termination of Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.160.8:9042-38, inFlight=0, closed=true]. This should not happen and is likely a bug, please report.

ERROR [2016-12-11 20:32:47,511] cluster2-reconnection-1 - com.datastax.driver.core.ControlConnection [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds

Andrew Tolbert

unread,

Dec 14, 2016, 10:31:08 PM12/14/16

to DataStax Java Driver for Apache Cassandra User Mailing List

Hi Nate,.

The 'Forcing termination fo Connection' message is a bit odd. This is logged when a cleanup task (the reaper) goes to close any connections that are registered with it that meet the criteria to be cleaned up. This message is only logged when you go to force close a connection and the future isn't done. What's weird here is that the connection is already closed, so the future associated with it should also be complete, so it is strange that isn't. It'd be hard to know why without more logs, but it's likely this is more a symptom of whatever is happening, not the actual issue.

I would consider updating to the latest version of 3.0 (3.0.6). There have been a number of fixes since then, although don't think anything related to this in particular. However, it could be that you are hitting some condition that might not exist anymore.

To collect more info it would be helpful to enable DEBUG logging on 'com.datastax.driver.core.Connection' and 'com.datastax.driver.core.ControlConnection' (or just com.datastax.driver.core in general if don't mind lots of logs :)).

Thanks,
Andy

Nate Mueller

unread,

Dec 14, 2016, 10:42:39 PM12/14/16

to java-dri...@lists.datastax.com

Sounds good. I'll get that deployed this week and respond with more logs when it happens again.

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

--

Nate Mueller - Head of Engineering - RetailNext - 406-356-6283

Nate Mueller

unread,

Dec 21, 2016, 2:40:23 PM12/21/16

to java-dri...@lists.datastax.com

I haven't been able to upgrade the driver yet but I did increase logging verbosity and caught another hang. This is the debug log for those two log topics for the two days the system was running.

cyanite.log.gz

Andrew Tolbert

unread,

Dec 28, 2016, 11:27:39 AM12/28/16

to DataStax Java Driver for Apache Cassandra User Mailing List

Hi Nate,

Thanks for following up! My apologies for the delay in response.

Just curious, is the client or C* nodes under heavy GC or CPU load at the time things or is it possible there is some kind of network interruption when things get wedged? I noticed that heartbeats are failing / timing out and that's likely why the connections are being closed. This in particular is somewhat concerning:

DEBUG [2016-12-17 22:57:07,782] cluster1-nio-worker-2 - com.datastax.driver.core.Connection Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-9, inFlight=0, closed=false] was inactive for 30 seconds, sending heartbeat
DEBUG [2016-12-17 22:57:20,601] cluster1-timeouter-0 - com.datastax.driver.core.Connection Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-9, inFlight=0, closed=true] closing connection
DEBUG [2016-12-17 22:57:20,601] cluster1-timeouter-0 - com.datastax.driver.core.Connection Not terminating Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-9, inFlight=0, closed=true]: there are still pending requests
DEBUG [2016-12-17 22:57:57,403] cluster1-nio-worker-2 - com.datastax.driver.core.Connection Connection[cassandra-cyanite-ring.ec2.nearbuysystems.com/10.7.162.162:9042-9, inFlight=0, closed=true] Response received on stream 1216 but no handler set anymore (either the request has timed out or it was closed due to another error). Received message is SUPPORTED {COMPRESSION=[snappy, lz4], CQL_VERSION=[3.4.2]}

The odd thing is that it takes almost exactly 50 seconds for the SUPPORTED message (what gets sent back from the OPTIONS heartbeat) to be received. I'll keep looking at the logs but that seemed odd.

Thanks,
Andy

To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

--
Nate Mueller - Head of Engineering - RetailNext - 406-356-6283

Nate Mueller

unread,

Dec 28, 2016, 11:57:34 AM12/28/16

to java-dri...@lists.datastax.com

Thanks for taking a look. One of the challenges is that the database connection in question is the one we use to save telemetry from our production systems so I don't have a lot of visibility into the state of things starting ~15 seconds before the issue occurs.

Repeated network outages of 30+ seconds would be surprising. The machines are all in EC2 and we have other instances on the same subnet where short outages like that would be very visible. Load is a possibility. We're seeing the issue on a couple machines, but the one that's seeing it the most averages 20% idle. I'll give it more resources and see if that reduces the frequency of failures.

To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

Nate Mueller

unread,

Jan 25, 2017, 6:13:17 PM1/25/17

to java-dri...@lists.datastax.com

Update from one month later. I doubled the size of the two machines and this problem has not happened since.

Reply all

Reply to author

Forward