I've attached another log snippet for this issue.
The scenario: 2-node cluster with replication factor 2, both nodes fully repaired.
The test is running a loop on the first node every two seconds, issuing a "get-token" request that issues two prepared select statements in parallel.
The 2nd node is rebooted.
When the 2nd node's Cassandra stops, a lookup request fails after a 12-second timeout, is retried and fails again after a 12-second timeout.
Note that the "is down" notifications come 7 seconds after the first "Request timed out" response, but the retried request waits an additional 5 seconds before receiving a "request timed out" error.
I know I can adjust the timeout interval, but I'm not sure what a reasonable lower value might be. Any suggestions?
Is there any possibility of treating the in-flight requests as timed-out immediately when the "is down" notification comes in?
Also, my app received a notice that the .31 node was down 21 seconds before the "is down" notice was logged by the cpp-driver. Is there some way my app could advise the cpp-driver that it should consider a node to be down, even though the "is down" notice has not yet arrived from Cassandra server?
Thanks,
Kevin Kingdon