cpp-driver "request timed out" errors during node warmup

Kevin Kingdon

unread,

Dec 3, 2014, 1:57:23 PM12/3/14

to cpp-dri...@lists.datastax.com

Hi,
We are seeing "request timed out" errors being returned from the beta-5 cpp-driver during node warmup. The test case is a node being rebooted in a multi-node cluster. The keyspace resilience factor allows for the node to be rebooted, e.g., the resilience factor is 3 in a 4-node cluster, and one node is rebooted. Requests are using a consistency level of "ONE". It appears that the cpp-driver is notified about a node's availability early. The cpp-driver attempts to connect and send requests to the node. The requests fail, and the driver gives up without retrying the request on a different node. If our application retries the request, we see the same results. Things get better when the node has warmed up, but our application has suffered a partial service outage for 30-45 seconds despite the resilience factor allowing for the node reboot. This could translate to thousands of failed requests and seriously impede our ability to deliver real-time, "always available" service.

Could this be a consequence of the default token-aware routing of requests?
Thanks,
Kevin Kingdon

Michael Penick

unread,

Dec 3, 2014, 3:55:54 PM12/3/14

to cpp-dri...@lists.datastax.com

Once the request is written to a connection it cannot retry because not all operations in Cassandra are idempotent (e.g. counters). The driver doesn't parse queries so it's unable to automatically determine which request are safe to retry. In these cases, it's up to the application to retry the query if it deems the operation safe to retry. The driver will return CASS_ERROR_LIB_REQUEST_TIMED_OUT for a requests that have been written to a connection and the connection closed. This is done because the response is never going to be returned so it would have eventually timed out anyway. It's mostly able to detect the connection close and notify the written requests fairly quickly without waiting the full request timeout duration. This allows application to quickly retry the request. Even if Cassandra notifies the driver that it's going down there's still a window where some requests can be written to that node before being getting the signal and Cassandra is not going to wait around to service all those requests. It should be noted that in all other cases (other than the request already being written) the driver retries the request on other connections. I'm up for suggestions on how to best handle the issue.

The timeout issue was always possible, but I'm not sure why it's more pronounced with beta5/token-awareness in your environment. I'll do some digging.

Mike

To unsubscribe from this group and stop receiving emails from it, send an email to cpp-driver-us...@lists.datastax.com.

Kevin Kingdon

unread,

Dec 3, 2014, 5:13:08 PM12/3/14

to cpp-dri...@lists.datastax.com

From the log files, it appears that the application-retried requests are repeatedly sent to the same node that is warming-up. There doesn't appear to be any attempt to avoid the node that is not really in a working state.

Kevin Kingdon

unread,

Dec 3, 2014, 5:41:39 PM12/3/14

to cpp-dri...@lists.datastax.com

I've attached the cpp-driver log lines (prefixed by our application). You can see a lot of thrashing during the node reboot times.

db.log

Michael Penick

unread,

Dec 3, 2014, 5:54:53 PM12/3/14

to cpp-dri...@lists.datastax.com

This is a known Cassandra issue: https://issues.apache.org/jira/browse/CASSANDRA-8236 Cassandra sends the up and and node events before it's actually ready to take traffic. That said, the requests should not attempt to use those pools until they're connected and ready (https://github.com/datastax/cpp-driver/blob/1.0/src/io_worker.cpp#L157) even though those nodes are in the query plan.

What specific log entries are you seeing?

Mike

Michael Penick

unread,

Dec 3, 2014, 6:04:34 PM12/3/14

to cpp-dri...@lists.datastax.com

Ah, thanks. I didn't see this come it before my last email. This means that it's attempting to connect, but requests won't use those connection pools until they actually have connections that are ready to take responses.

You can reduce some of that connection reattempt thrashing by increasing the cass_cluster_set_reconnect_wait_time() (the default is 2000 ms), but until CASSANDRA-8236 sends the events at the apporiate time there will be some amount of thrashing. Eventually, we'll have more configuration reconnection policies including an exponential reconnection policy (ala https://github.com/datastax/java-driver/blob/2.1/driver-core/src/main/java/com/datastax/driver/core/policies/ExponentialReconnectionPolicy.java)

Mike

Kevin Kingdon

unread,

Dec 4, 2014, 8:04:18 PM12/4/14

to cpp-dri...@lists.datastax.com

I've attached another log snippet for this issue.
The scenario: 2-node cluster with replication factor 2, both nodes fully repaired.
The test is running a loop on the first node every two seconds, issuing a "get-token" request that issues two prepared select statements in parallel.

The 2nd node is rebooted.

When the 2nd node's Cassandra stops, a lookup request fails after a 12-second timeout, is retried and fails again after a 12-second timeout.

Note that the "is down" notifications come 7 seconds after the first "Request timed out" response, but the retried request waits an additional 5 seconds before receiving a "request timed out" error.

I know I can adjust the timeout interval, but I'm not sure what a reasonable lower value might be. Any suggestions?
Is there any possibility of treating the in-flight requests as timed-out immediately when the "is down" notification comes in?

Also, my app received a notice that the .31 node was down 21 seconds before the "is down" notice was logged by the cpp-driver. Is there some way my app could advise the cpp-driver that it should consider a node to be down, even though the "is down" notice has not yet arrived from Cassandra server?

Thanks,
Kevin Kingdon

test.log

Michael Penick

unread,

Dec 22, 2014, 11:50:56 AM12/22/14

to cpp-dri...@lists.datastax.com

Sorry, for the delayed response, but I hope my news will make up for it a little.

I was able to get some time last week to test your scenario. I was able to reproduce your result: The first response would timeout for 12 seconds then you would get another full 12 second timeout even though the "DOWN" event had already come. The RC1 release (out today) includes a fix the immediately disconnects connections to hosts that received "DOWN" event (pre-beta5 behavior). This does make the driver more susceptible to transient partitions in the Cassandra cluster causing unnecessary driver reconnects. This problem is less likely (but still possible).

Mike

Reply all

Reply to author

Forward