C++ Driver does not retry connection to other contact point.

Jeremy Lisenby

не прочитано,

28 сент. 2015 г., 09:43:4328.09.2015

– DataStax C++ Driver for Apache Cassandra User Mailing List

I have set up a test to verify redundancy. The Cassandra server appears to work properly. The driver does not work as expected or documented. I have two contact points configured in the driver. I initiate my test which constantly loops through a set of select statements. I then disconnect the Ethernet cable from one of my nodes which is also acting as a contact point. My expectation given the default retry policy of the driver is that all my reads will succeed. Instead I see every other read fail with "Request timed out". Every other read fails until I reconnect the Ethernet cable to the second node.

Details of my setup:

I have a two-node Cassandra cluster with replication factor set to 2 for my keyspace. Sure that's not the best use of clustering, but this just a test. My Cassandra cluster is on CentOS 6.7 installed from the apache-cassandra-2.2.1-bin.tar.gz tarball.

My client application is CentOS 5.8, I have installed the latest available driver for the Centos 5 distribution found at http://downloads.datastax.com/cpp-driver/centos/5/.

[root@jklvm cassandra]# rpm -qa | grep cassandra
cassandra-cpp-driver-devel-2.1.0-1.el5.centos.amd64
cassandra-cpp-driver-2.1.0-1.el5.centos.amd64

On the client, I lowered my request timeout to 100ms. For my intended application fast failure (and retry) is better than slow success.

Jeremy Lisenby

не прочитано,

28 сент. 2015 г., 11:55:1428.09.2015

– DataStax C++ Driver for Apache Cassandra User Mailing List

I changed my retry policy to logging(default). I reverted back to the default timeout as well.

After disconnecting Ethernet cable from node_2, I was expecting a call to the policy's on_read_timeout(), but I got this instead :

1443454767.956 [INFO] (src/connection.cpp:696:static void cass::Connection::on_timeout(cass::Timer*)): Request timed out to host 172.18.48.184

After reconnecting the Ethernet cable to node_2, I get this indicating that my node_1 is down even though requests are successfully being processed there:

1443454787.220 [INFO] (src/control_connection.cpp:289:virtual void cass::ControlConnection::on_event(cass::EventResponse*)): Node 172.18.48.215 is down
1443454787.220 [DEBUG] (src/io_worker.cpp:272:virtual void cass::IOWorker::on_event(const cass::IOWorkerEvent&)): REMOVE_POOL event for 172.18.48.215 closing pool(0xe2c8100) io_worker(0xe2912a0)
1443454787.220 [DEBUG] (src/pool.cpp:93:void cass::Pool::close(bool)): Closing pool(0xe2c8100)
1443454787.220 [DEBUG] (src/connection.cpp:545:static void cass::Connection::on_close(uv_handle_t*)): Connection to host 172.18.48.215 closed
1443454787.220 [INFO] (src/io_worker.cpp:203:void cass::IOWorker::notify_pool_closed(cass::Pool*)): Pool for host 172.18.48.215 closed: pool(0xe2c8100) io_worker(0xe2912a0)
1443454787.220 [DEBUG] (src/pool.cpp:55:virtual cass::Pool::~Pool()): Pool dtor with 0 pending requests pool(0xe2c8100)
1443454787.221 [DEBUG] (src/io_worker.cpp:325:void cass::IOWorker::schedule_reconnect(const cass::Address&)): Scheduling reconnect for host 172.18.48.215 io_worker(0xe2912a0)
1443454789.226 [DEBUG] (src/pool.cpp:70:void cass::Pool::connect()): Connect 172.18.48.215 pool(0xe2c8100)
1443454789.226 [INFO] (src/pool.cpp:254:void cass::Pool::spawn_connection()): Spawning new connection to host 172.18.48.215:9042
1443454789.228 [DEBUG] (src/connection.cpp:509:static void cass::Connection::on_connect(cass::Connector*)): Connected to host 172.18.48.215
1443454789.237 [DEBUG] (src/control_connection.cpp:485:void cass::ControlConnection::refresh_node_info(cass::SharedRefPtr<cass::Host>, bool, bool)): refresh_node_info: SELECT peer, data_center, rack, rpc_address FROM system.peers WHERE peer = '172.18.48.215'
1443454789.238 [DEBUG] (src/io_worker.cpp:144:void cass::IOWorker::add_pool(const cass::Address&, bool)): Host 172.18.48.215 already present attempting to initiate immediate connection
1443454790.546 [INFO] (src/connection.cpp:696:static void cass::Connection::on_timeout(cass::Timer*)): Request timed out to host 172.18.48.184
1443454790.546 [WARN] (src/connection.cpp:167:virtual void cass::Connection::HeartbeatHandler::on_timeout()): Heartbeat request timed out on host 172.18.48.184
1443454790.603 [INFO] (src/control_connection.cpp:283:virtual void cass::ControlConnection::on_event(cass::EventResponse*)): Node 172.18.48.215 is up

To clarify, I open a connection to two nodes as my contact points. Both nodes have 100% of the data. I perform all my select statements in a loop on this same connection. While the client is running this loop, I disconnect the Ethernet cable (actually a VirtuaBox image in which I uncheck "Cable Connected").

Michael Penick

не прочитано,

30 сент. 2015 г., 11:59:5230.09.2015

– cpp-dri...@lists.datastax.com

Retry policies are for handling server-side failures of reads/writes to replicas. On the client-side, not all Cassandra requests are idempotent and that's why requests are not automatically retried once they've been successfully written to a connection (because there's no way to know if the outstanding operation was successful). The consequence of this is when a node is terminated all in-flight requests to the failed node could time out and return an error to the application. However, this allows the applications to decide whether it's safe to retry the request.

The drivers don't parse query strings so it's not currently possible to automatically determine if an operation is idempotent. One possible solution could be to add a setting to CassStatement that allows the applications to specify whether a request is safe to retry automatically from the client-side. However, the outstanding requests to a failed node will still have to wait for the whole request time out period, but there is potential to optimize the retry and simplify integrating applications. What do you think?

As a stop gap, your application could retry requests when the the request time out error is returned (where it has been determined that it's safe to do so).

Mike

Jeremy Lisenby

не прочитано,

30 сент. 2015 г., 18:32:5630.09.2015

– DataStax C++ Driver for Apache Cassandra User Mailing List

Thank you Mike for your reply. I have made some comments inline.

On Wednesday, September 30, 2015 at 10:59:52 AM UTC-5, Michael Penick wrote:
> Retry policies are for handling server-side failures of reads/writes to replicas. On the client-side, not all Cassandra requests are idempotent and that's why requests are not automatically retried once they've been successfully written to a connection (because there's no way to know if the outstanding operation was successful). The consequence of this is when a node is terminated all in-flight requests to the failed node could time out and return an error to the application. However, this allows the applications to decide whether it's safe to retry the request.
>

That is a good point, I had not considered idempotence as I was performing only SELECT statements.

I was focused more on the third bullet item of the documentation for the default retry policy.

"On unavailble, it will move to the next host"

Out of curiosity, what does this statement mean for server-side failures?

In an early post Alex Popescu wrote:
"
The driver provides numerous features to ensure continuing operations:

1. connection pooling
2. autoreconnections
3. load balancing and retries
"
I think this is how I came by my expectations. Perhaps if the contact point is not accessible it should be dropped from the connection pool until autoreconnection succeeds for that pool member. I understand that this does change the need to return failure to the user application for in-flight message timeouts.

Interestingly, this is the behavior if the node is inaccessible at the time the cass_session_connect is issued. I get no failures in this case. I can even bring the node online (reconnect the ethernet cable) after the connect and have it start processing requests as a contact point.

>
> The drivers don't parse query strings so it's not currently possible to automatically determine if an operation is idempotent. One possible solution could be to add a setting to CassStatement that allows the applications to specify whether a request is safe to retry automatically from the client-side. However, the outstanding requests to a failed node will still have to wait for the whole request time out period, but there is potential to optimize the retry and simplify integrating applications. What do you think?
>

I could see that as a useful option. I would probably end up wrapping all my read-only operations with this setting at the very least. The driver would have to do more than just parse query strings, it would also need to know something about the schema, right?

>
> As a stop gap, your application could retry requests when the the request time out error is returned (where it has been determined that it's safe to do so).
>

This is the first option that jumps out. I just wanted to be sure that I was using the driver correctly. I did not want to add retry loops everywhere if the driver already had this richness built in. I will have to retry a number of times equal to the number of contact points I added in my CassCluster. Suppose I had 10 contact points and 5 are inaccessible due to a partial network outage.

>
> Mike

Alex Popescu

не прочитано,

1 окт. 2015 г., 12:19:4601.10.2015

– DataStax C++ Driver for Apache Cassandra User Mailing List

On Wednesday, September 30, 2015 at 8:59:52 AM UTC-7, Michael Penick wrote:
> One possible solution could be to add a setting to CassStatement that allows the applications to specify whether a request is safe to retry automatically from the client-side.
>

> Mike

This is a great idea and it would be useful later for supporting SpeculativeRetries too.

--
Bests,

Alex Popescu | @al3xandru
Sen. Product Manager @ DataStax

DataStax

не прочитано,

1 окт. 2015 г., 12:20:5401.10.2015

– cpp-dri...@lists.datastax.com

Indeed, that’s what the Java Driver does with Statement.setIdempotent(bool)

> To unsubscribe from this group and stop receiving emails from it, send an email to cpp-driver-us...@lists.datastax.com.

Jeremy Lisenby

не прочитано,

2 окт. 2015 г., 10:18:2202.10.2015

– DataStax C++ Driver for Apache Cassandra User Mailing List

On Wednesday, September 30, 2015 at 5:32:56 PM UTC-5, Jeremy Lisenby wrote:

> I will have to retry a number of times equal to the number of contact points I added in my CassCluster. Suppose I had 10 contact points and 5 are inaccessible due to a partial network outage.
>

Ok, so this turned out to not be true because the connect leaves out unreachable contact points. So the maximum reconnects required would be '1'.

Ответить всем

Отправить сообщение автору

Переслать