Client.execute not invoking the callback on a select CQL

208 views
Skip to first unread message

Krassimir Fotev

unread,
Oct 8, 2014, 10:38:14 AM10/8/14
to nodejs-dr...@lists.datastax.com
I am tracking down a strange behavior, which from a driver's end-user point of view boils to a callback not being invoked for hours.

The application is a crawler that uses Cassandra as persistent storage for urls to process, stats and content retrieved. On my laptop I am running a single Cassandra 2.1 node. The upgrade of node.js, drivers and Cassandra distribution is yet to be tested on a server.

I've put wrappers looking for runaway queries with no results reported back. Noticed that while the writes are low latency and do always invoke the callbacks, the selects take longer and may not always result in a callback being invoked. With invocation delay going into hours, I suspect something has done wrong, either in the driver, the transport, or Cassandra itself. 

As the amount of queries with no returns increases, we eventually start getting "Operation timed out - received only 0 responses." like in

The rpc_server_type is hsha. There is multiple writes hitting Cassandra simultaneously. The amount of reads has been reduced to a bare minimum with the transaction support and IF condition. The values being read are with TTL. There is DELETEs going agains the table we are reading from. 

While I will try to eliminate the DELETE and TTL and see if the SELECT problem remains, I have the feel there's is something deeper, hence the writing. Let me know if you need my help reproducing or at least eliminating the driver as a cause for the observed behavior.

Thanks,
-Krassimir

Jorge Bay Gondra

unread,
Oct 8, 2014, 11:09:02 AM10/8/14
to nodejs-dr...@lists.datastax.com
Hi Krassimir,
If you are getting ResponseError with status "Timed out", the problem is most likely in your Cassandra cluster (type of schema / query / hardware).
You can configure the Cassandra read timeout in the cassandra.yaml (default 10 secs), you should set it to a lower value and try to go from there (check Cassandra logs, etc).

Normally, you should get read timeouts when Cassandra can not keep up with the amount SELECT queries. If you are running in modest hardware for the amount of request, this is somehow common. If this is what is happening, try reducing the pressure on the Cassandra node.

If the driver is deadlocking (not because it is queuing new queries but because a driver bug) we can look into it.

Jorge
--
DataStax

To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.

Krassimir Fotev

unread,
Oct 8, 2014, 12:32:18 PM10/8/14
to nodejs-dr...@lists.datastax.com
Jorge: Thanks for the reply. My main concern is, I DO NOT see Timeouts reliably.

It is more likely to get nothing, i.e. no callback invocation than a timeout. Can a driver verbose log help in any way?

Attached, please find the timeout values from the cassandra.yaml, WARN and ERROR entries in the logs.

Rgds,
-Krassimir


---
read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 5000
counter_write_request_timeout_in_ms: 10000
cas_contention_timeout_in_ms: 3000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 30000

--
Quick look at the logs showed this error:

ERROR [CompactionExecutor:122] 2014-10-07 21:18:48,318 CassandraDaemon.java:166 - Exception in thread Thread[CompactionExecutor:122,1,main]
java.lang.NullPointerException: null
at org.apache.cassandra.service.CacheService$KeyCacheSerializer.serialize(CacheService.java:475) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.CacheService$KeyCacheSerializer.serialize(CacheService.java:463) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.cache.AutoSavingCache$Writer.saveCache(AutoSavingCache.java:225) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.db.compaction.CompactionManager$11.run(CompactionManager.java:1061) ~[apache-cassandra-2.1.0.jar:2.1.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_67]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]

and these warnings popping up frequently in the logs

WARN  [MigrationStage:1] 2014-10-07 18:32:29,350 LeveledCompactionStrategy.java:67 - Max sstable size of 5MB is configured for domain_stat.domain_stat.  Testing done for CASSANDRA-5727 indicates that performance improves up to 160MB

WARN  [SharedPool-Worker-17] 2014-10-07 17:47:18,064 SliceQueryFilter.java:236 - Read 2 live and 1274 tombstoned cells in crawler.url_queue_time (see tombstone_warn_threshold).

Krassimir Fotev

unread,
Oct 8, 2014, 4:35:42 PM10/8/14
to nodejs-dr...@lists.datastax.com
Jorge: An exception thrown by the callback stack, caused up to 128 requests to not return/invoke a callback.

Seems the callback exception disrupted whatever this particular connection (queue?) was doing.

It may be a good idea catching the callback exceptions, at very least when some kind of debugging option is set, and warn the developer. Otherwise, if left undocumented and/or unnoticed, these exceptions will cause failures in different, "unrelated" branches of the code. Finding the root cause will turn difficult with no hints.

Btw, just made a pull request fro a minor bug fix: running a query from with the client.connect callback causes 

TypeError: Cannot read property 'length' of undefined
    at Connection.getInFlight (/Users/kfotev/Work/PierBelt/Wordpress/crawler-lite/node_modules/cassandra-driver/lib/connection.js:366:66)
    at /Users/kfotev/Work/PierBelt/Wordpress/crawler-lite/node_modules/cassandra-driver/lib/utils.js:178:24
    at Array.sort (native)
    at getLeastBusy (/Users/kfotev/Work/PierBelt/Wordpress/crawler-lite/node_modules/cassandra-driver/lib/host.js:116:24)
    at fn (/Users/kfotev/Work/PierBelt/Wordpress/crawler-lite/node_modules/cassandra-driver/node_modules/async/lib/async.js:641:34)
    at Object._onImmediate (/Users/kfotev/Work/PierBelt/Wordpress/crawler-lite/node_modules/cassandra-driver/node_modules/async/lib/async.js:557:34)
    at processImmediate [as _immediateCallback] (timers.js:345:15)

Hopefully, you'll accept and merge.

Thanks,
-Krassimir

Jorge Bay Gondra

unread,
Oct 9, 2014, 4:42:19 AM10/9/14
to nodejs-dr...@lists.datastax.com
Hi Krassimir,
On the driver side, when enqueing (because there are too many requests without responses) it already issues a warning in the log.
On Cassandra side, it seems that compaction (you could tune for that) is causing the timeouts. You are using Leveled Compaction, I dont know if its suitable for  your use case (high read vs write ratios, etc).

Thanks,
Jorge

Krassimir Fotev

unread,
Oct 9, 2014, 9:47:03 AM10/9/14
to nodejs-dr...@lists.datastax.com
Jorge: That is cool about the pressure warning! Beyond the Cassandra issued timeouts. My concern is a damaged driver..

This was what I meant. The client is giving you callbacks. Callbacks the driver invokes when particular request is done, has data chuck fetched, etc. These callbacks were never invoked for some requests - no error, result notification, nothing!

It turns, if one of the client provided callbacks throws, the entire driver stops functioning well. This is not immediately apparent. Over time requests do stay within the driver forever. After an exception in a callback, some requests never make it to Cassandra, never timeout, never result in a client provided callback invocation.

You should probably wrap the client callback invoke in a try catch, protecting the driver's functionality and not leaving the driver at the mercy of a client code you have no control over. That was my main point.

Thanks,
-Krassimir
Reply all
Reply to author
Forward
0 new messages