Node marked as down in cluster, but Python driver continues to try to access it.

45 views
Skip to first unread message

Andrew

unread,
Feb 9, 2015, 2:07:40 PM2/9/15
to python-dr...@lists.datastax.com

Hello,

We are running into a problem while accessing our Cassandra cluster.  

Our cluster of nodes communicates over two different interfaces. For communication amongst the Cassandra cluster, they talk over a public interface, eth0. When the Python driver is communicating with the nodes, it talks to them over a private network, eth1.

In the event that a node loses connectivity over eth0, the other nodes in the cluster correctly mark it as down. However, that node still sees itself as up. When the driver attempts to communicate with the cluster over eth1, it can reach all the nodes, but the specific node whose eth0 interface is down is never able to achieve quorum for the requests. The result is an error on the driver side.

Our hope was that there was some sort of policy or parameter that would allow for the driver to mark said node as down if it continually was not able to achieve quorum and the rest of the cluster was viewing it as down. Is there a function or tuning parameter for the Python driver that might help alleviate this issue?

For reference, we've tested using both the 1.1.1 and the 2.1.4 drivers, and have seen the same problem (failing requests from the driver after minutes of eth0 being down) with both.

Thank you in advance!

Andrew



Adam Holmberg

unread,
Feb 13, 2015, 3:53:05 PM2/13/15
to python-dr...@lists.datastax.com
This is an interesting setup. There's nothing built specifically for that. I've given this a little thought, and the best I've conceived of *with what is there* would be one of the following:

Retry policy
----------------
This is a long shot, but if (a) your RF is close to the number of nodes in the cluster, (b) you're trying IO at a higher CL, and (c) your app could deal with a lower CL, you can use DowngradingConsistencyRetryPolicy to make the driver automatically retry with lower CL. Unfortunately it retries on the same node. I've opened a ticket to possibly change that.

Custom load balancing
------------------------------
1.) Use a retry policy that always raises on_unavailable (the default does this)
2.) Wrap your load balancing policy with a custom implementation that allows your application to add and remove blacklisted hosts.
    Blacklisted hosts are either filtered from the child query plan altogether, or at least tried after all others.
    Blacklisted hosts will need to be timed-out or removed on some other signal to allow their use later.
3.) Execute statements using async, obtaining a reference to the ResponseFuture.
4.) On except Unavailable: get the future._current_host, and blacklist it in your load balancing policy. (retry the failed statement)

Hopefully something like this will allow your client to circumnavigate this type of outage. This might be easier to customize if we expanded some of the other policy interfaces to be host-aware, but I can't offer that in the immediate future.

Regards,
Adam Holmberg


To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Reply all
Reply to author
Forward
0 new messages