Details of my setup:
I have a two-node Cassandra cluster with replication factor set to 2 for my keyspace. Sure that's not the best use of clustering, but this just a test. My Cassandra cluster is on CentOS 6.7 installed from the apache-cassandra-2.2.1-bin.tar.gz tarball.
My client application is CentOS 5.8, I have installed the latest available driver for the Centos 5 distribution found at http://downloads.datastax.com/cpp-driver/centos/5/.
[root@jklvm cassandra]# rpm -qa | grep cassandra
cassandra-cpp-driver-devel-2.1.0-1.el5.centos.amd64
cassandra-cpp-driver-2.1.0-1.el5.centos.amd64
On the client, I lowered my request timeout to 100ms. For my intended application fast failure (and retry) is better than slow success.
After disconnecting Ethernet cable from node_2, I was expecting a call to the policy's on_read_timeout(), but I got this instead :
1443454767.956 [INFO] (src/connection.cpp:696:static void cass::Connection::on_timeout(cass::Timer*)): Request timed out to host 172.18.48.184
After reconnecting the Ethernet cable to node_2, I get this indicating that my node_1 is down even though requests are successfully being processed there:
1443454787.220 [INFO] (src/control_connection.cpp:289:virtual void cass::ControlConnection::on_event(cass::EventResponse*)): Node 172.18.48.215 is down
1443454787.220 [DEBUG] (src/io_worker.cpp:272:virtual void cass::IOWorker::on_event(const cass::IOWorkerEvent&)): REMOVE_POOL event for 172.18.48.215 closing pool(0xe2c8100) io_worker(0xe2912a0)
1443454787.220 [DEBUG] (src/pool.cpp:93:void cass::Pool::close(bool)): Closing pool(0xe2c8100)
1443454787.220 [DEBUG] (src/connection.cpp:545:static void cass::Connection::on_close(uv_handle_t*)): Connection to host 172.18.48.215 closed
1443454787.220 [INFO] (src/io_worker.cpp:203:void cass::IOWorker::notify_pool_closed(cass::Pool*)): Pool for host 172.18.48.215 closed: pool(0xe2c8100) io_worker(0xe2912a0)
1443454787.220 [DEBUG] (src/pool.cpp:55:virtual cass::Pool::~Pool()): Pool dtor with 0 pending requests pool(0xe2c8100)
1443454787.221 [DEBUG] (src/io_worker.cpp:325:void cass::IOWorker::schedule_reconnect(const cass::Address&)): Scheduling reconnect for host 172.18.48.215 io_worker(0xe2912a0)
1443454789.226 [DEBUG] (src/pool.cpp:70:void cass::Pool::connect()): Connect 172.18.48.215 pool(0xe2c8100)
1443454789.226 [INFO] (src/pool.cpp:254:void cass::Pool::spawn_connection()): Spawning new connection to host 172.18.48.215:9042
1443454789.228 [DEBUG] (src/connection.cpp:509:static void cass::Connection::on_connect(cass::Connector*)): Connected to host 172.18.48.215
1443454789.237 [DEBUG] (src/control_connection.cpp:485:void cass::ControlConnection::refresh_node_info(cass::SharedRefPtr<cass::Host>, bool, bool)): refresh_node_info: SELECT peer, data_center, rack, rpc_address FROM system.peers WHERE peer = '172.18.48.215'
1443454789.238 [DEBUG] (src/io_worker.cpp:144:void cass::IOWorker::add_pool(const cass::Address&, bool)): Host 172.18.48.215 already present attempting to initiate immediate connection
1443454790.546 [INFO] (src/connection.cpp:696:static void cass::Connection::on_timeout(cass::Timer*)): Request timed out to host 172.18.48.184
1443454790.546 [WARN] (src/connection.cpp:167:virtual void cass::Connection::HeartbeatHandler::on_timeout()): Heartbeat request timed out on host 172.18.48.184
1443454790.603 [INFO] (src/control_connection.cpp:283:virtual void cass::ControlConnection::on_event(cass::EventResponse*)): Node 172.18.48.215 is up
To clarify, I open a connection to two nodes as my contact points. Both nodes have 100% of the data. I perform all my select statements in a loop on this same connection. While the client is running this loop, I disconnect the Ethernet cable (actually a VirtuaBox image in which I uncheck "Cable Connected").
This is a great idea and it would be useful later for supporting SpeculativeRetries too.
--
Bests,
Alex Popescu | @al3xandru
Sen. Product Manager @ DataStax
> I will have to retry a number of times equal to the number of contact points I added in my CassCluster. Suppose I had 10 contact points and 5 are inaccessible due to a partial network outage.
>
Ok, so this turned out to not be true because the connect leaves out unreachable contact points. So the maximum reconnects required would be '1'.