Recently one node out of 32 nodes in a single data center in our ring crashed and had to be replaced, so the node was off line for about 10 hours.
I expected the Cassandra ring can sustain a single node failure without impacting the applications on top, and that was our experience when we use Hector client library.
We however noticed that we constantly received exceptions during hours when the node was down. Most of exceptions are PoolTimeoutException, some are ConnectionAbortException and TimeoutExcepton. The ConnectionPoolMonitor interface does have methods to report a host is down, so I thought there must be a way to configure Astyanax not to try to connect downed hosts.
But we have done quite a bit research through internet and do not seem to find a good answer. I then looked the source code of Astynax a little bit (more specifically RoundRobinExecuteWithFailover.java) and it does not appear the code exclude downed hosts. I am wondering if Astyanax relies on retries to go to next host instead of recognizing downed host and not to retry at all?
So I am wondering anyone here has any suggestions and information that can help us?
Also, I realize Astyanax has made a lot of updates, we switched from Hector to Astyanax about a year ago, and Astyanax version we are using is 1.56.43, and now it is on 2.*.
As we are in production with very heavy load, I am hesitating just to try the new version without knowing it actually fixes the issue.
Thanks!
-- Benjamin Xu
Comcast