I'm doing some debugging of DCAwarePolicy and when it (via DCQueryPlan) decides that there are no more hosts available.
More specifically, I am doing a test with a 2 DC cluster, 4 Cassandra nodes per DC, with all Cassandra nodes in the local DC down. We're using version 2.8.0. I have disabled token and latency awareness so the only thing involved should be DCAwarePolicy, which I have configured with used_hosts_per_remote_dc of 2 and allow_remote_dcs_for_local_cl of cass_false. We do a lot CL=1 and CL=2 queries and I'm looking at how cpp-driver acts when we do these queries in this sort of situation (keyspaces are all NetworkTopologyStrategy with 3 replicas per DC).
Strangely (to me), this is very sensitive to used_hosts_per_remote_dc: if I set it to 1, I get 100% failure rate *from cpp-driver* indicating no hosts are available (specifically I get the CASS_ERROR_LIB_NO_HOSTS_AVAILABLE from io_worker.cpp, IOWorker::retry); if I set it to 2 I get a ~60% failure rate; if I set it to 3 I get a 0% failure rate.
I don't really understand what's going on here, and I can't really get enough out of the driver via logging to tell. The only explanation I have for the error behaviour I see is that the driver doesn't think it can use any of the hosts that it knows about in the remote DC and that this becomes less likely as I increase used_hosts_per_remote_dc; again all nodes in the remote DC are up.
I have gone in with gdb to poke around at the internal state of DCAwarePolicy and DCQueryPlan instances. DCAwarePolicy::per_remote_dc_live_hosts_ doesn't have all the hosts in the remote DC, but I've seen it exceed used_hosts_per_remote_dc as well; this seems to be an artifact of how Session interacts with the policy and how DCAwarePolicy::distance is implemented. All the Host instances I've seen are marked as UP.
Working in gdb can be misleading, though, since it can lead to timeouts and disconnects that trigger changes to these objects, and it is the behaviour described above that I'm trying to understand (specifically why I'm seeing errors at all, and why increasing used_hosts_per_remote_dc would make them go away...)
Next steps would be to rebuild our cpp-driver with additional logging to see better what is going on with the state of DCAwarePolicy, but I wanted to ask here first to see if anyone had some insight into this.
Cheers
Oliver
After adding more debugging I've tracked this down to this logic in IOWorker::retry:
if (it != pools_.end() &&
it->second->is_ready() &&
it->second->write(request_execution)) {
return; // Successfully written or pending
}
Specifically it turns out the it == pools_end(), meaning that the DCAwareQueryPlan is returning a host that isn't in the PoolMap pools_. I couldn't tell you why at this point. So it winds up skipping all the hosts returned from DCAwareQueryPlan in these failure cases and returning CASS_ERROR_LIB_NO_HOSTS_AVAILABLE. I guess that by bumping up used_hosts_per_remote_dc high enough that at least one of the hosts returned is present in pools_.
At this point we're considering just making used_hosts_per_remote_dc very large (larger than any reasonable cluster size)...
--
You received this message because you are subscribed to the Google Groups "DataStax C++ Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cpp-driver-user+unsubscribe@lists.datastax.com.
To unsubscribe from this group and stop receiving emails from it, send an email to cpp-driver-user+unsubscribe@lists.datastax.com.
~Fero