Pavel
unread,Mar 7, 2017, 9:27:00 AM3/7/17Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to DataStax C++ Driver for Apache Cassandra User Mailing List
Hello, I am observing a behavior which I consider to be strange, probably I misunderstood something around speculative execution. I have speculative execution set up on cluster where I am executing an idempotent read or write queries, but I am getting read or write timeouts sooner than I would expect without retries which I would expect to happen.
Now some more details:
- cluster has 3 nodes, cassandra.yaml configures “read_request_timeout_in_ms: 5000, write_request_timeout_in_ms: 2000”.
- replication factor is 3
- I am setting up cluster with 9 s timeout (cass_cluster_set_request_timeout(m_cluster, 9000) – I assume this is the total timeout for a query to be executed?)
- the cluster has speculative execution on, using 2 retries after 3 s (cass_cluster_set_constant_speculative_execution_policy(m_cluster, 3000, 2))
- statements are set to idempotent (cass_statement_set_is_idempotent(m_statement, cass_true)), they are read or write statement with some bound parameters, the write statements in question are prepared
- the read statements in question are using consistency level 1, the write statements consistency level local quorum
I have 3 different problems:
- Sometimes during the normal system run, I get message like “Read timeout: Operation timed out - received only 0 responses.”. I get this message after the query execution time (cass_session_execute to cass_future_error_code duration) of 5.005 s. Why is that? Shouldn’t the query be re-tried by the speculative execution and time out after 9 s if the other executions fail / do not respond within the time interval as well? Also, this is the read query with consistency level one, it is kind of hard to believe that none of the 3 nodes which should hold the replica were able to respond at all, even without the retry.
- Similar problem with writes, the write queries are using local quorum, errors similar to “Write timeout: Operation timed out - received only 1 responses.” appear after 2.002 s. Again, why is nothing retried automatically?
- When I unplug one of the nodes, some of the clients report “Request timed out: Request timed out” after 9.002 s. This appears like they were not able to contact coordinator node. Again, shouldn’t the speculative execution kick in? Shouldn’t the speculative execution then attempt to contact different coordinator?
Can somebody please explain me what is happening in 3 above described problems? Why am I getting back errors sooner than after 9 seconds / why am I getting them at all when I have speculative execution? What are the relations between the speculative executions and timeouts? Which timeout settings are applied in which situations and when are things retried? (total query timeouts? timeouts for contacting coordinator? timeout of reading from nodes / writing to nodes? when are things retried for idempotent queries?)
Kind regards,
Pavel Cernohorsky