Tracing ReadTimeoutException with Consistency LOCAL_ONE 0 replica responded

211 views
Skip to first unread message

Shinta Smith

unread,
Sep 22, 2016, 2:42:27 PM9/22/16
to DataStax Java Driver for Apache Cassandra User Mailing List
We have a Cassandra 2.0.9 cluster of 56 nodes with RF=3. We use Datastax Java driver version 3.0.3. We submit the queries asynchronously and then we wait for the results by calling future.getUninterruptibly() .

Under stress, our apps started getting:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)

The query is a range query:

SELECT * FROM mcf WHERE key = :key AND column1 >= :start AND column1 <= :end;

The schema for mcf is very simple:

CREATE TABLE IF NOT EXISTS mcf (
    key text,
    column1 bigint,
    value blob,
    PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND speculative_retry = 'NONE';


Running this query from cqlsh after we saw the exception in our application logs, always returns data very quickly. And the data is not huge, only 20-40 rows.

Someone from Datastax in this thread
https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/zqHIjLV4kKI/E9f4IRubcZIJ
suggested to turn on Query Tracing.

The problem is, the query traces are attached to ResultSet, which we must obtain using future.getUninterruptibly() . But it is this method call that we are getting ReadTimeoutException from -- we can not call resultSet.getExecutionInfo on that.

PreparedStatement ps = session.prepare("SELECT * FROM mcf WHERE key = :key AND column1 >= :start AND column1 <= :end");
ps
.enableTracing();
ResultSetFuture future = session.executeAsync(ps.bind("mykey", startTs, endTs));
...

ResultSet rs = null;
try {
    rs
= future.getUninterruptibly();
   
List<Row> rows = rs.all();
   
// process them

} catch(ReadTimeoutException rte) {
   
// log queries getting timeout
    log
.debug("timeout on queryId: " + rs.getExectionInfo().getQueryTrace().getTraceId());
   
//                                 ^^ but rs is null
} catch(Exception ex) {
   
....
}

My questions are:
* is there a way to get at least the queryId before we call future.getUninterruptibly() method? If we can just get the queryId, we can go to system_traces and get the events manually from cqlsh.
* is there a better way to debug/troubleshoot this issue?

Thanks,
-shinta

Olivier Michallat

unread,
Sep 22, 2016, 6:26:39 PM9/22/16
to java-dri...@lists.datastax.com
Hi,

Unfortunately the driver provides no way to retrieve the tracing id from a failed response. And actually Cassandra doesn't even return it: in theory the native protocol allows it (an ERROR response frame can contain a tracing id), but in practice it's not included, even though traces are generated server-side and stored in the system tables.

So your best bet right now would be to look up your failing query manually in the tracing tables (select * from system_traces.sessions). Admittedly, this can quickly become impractical if you trace a lot of queries.


--

Olivier Michallat

Driver & tools engineer, DataStax


--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

Shinta Smith

unread,
Sep 22, 2016, 6:50:30 PM9/22/16
to java-dri...@lists.datastax.com
Olivier,
Thanks for responding. This is what I'm afraid of. :-( Yes, we could be tracing a lot of queries. Finding the ones that are timing out could be looking for a needle in the haystack.  

We use default timeouts of 5s in cassandra.yaml. I am wary of increasing this. Do you think I should? How do I resolve these query timeouts ?

Thanks,
-shinta

Olivier Michallat

unread,
Sep 26, 2016, 5:23:24 PM9/26/16
to java-dri...@lists.datastax.com
Hi Shinta,

ReadTimeoutException is a server-side issue (it indicates a timeout between the coordinator and a replica). As a general guideline, you should look at the following on the server:
- GC
- proxyhistogram and cfstats
- the size of your queries
- checking that you aren't causing tombstone overwhelming warnings.

I would also suggest asking on the cassandra-user list, where you might find more people who've dealt with this kind of issue.

--

Olivier Michallat

Driver & tools engineer, DataStax


On Thu, Sep 22, 2016 at 3:50 PM, Shinta Smith <shinta...@gmail.com> wrote:
Olivier,
Thanks for responding. This is what I'm afraid of. :-( Yes, we could be tracing a lot of queries. Finding the ones that are timing out could be looking for a needle in the haystack.  

We use default timeouts of 5s in cassandra.yaml. I am wary of increasing this. Do you think I should? How do I resolve these query timeouts ?

Thanks,
-shinta
Reply all
Reply to author
Forward
0 new messages