ReadTimeout over large number of rows

Edward Tseng

unread,

Feb 20, 2015, 2:18:01 PM2/20/15

to python-dr...@lists.datastax.com

Hi,

This may or may not be a problem with the Python-Cassandra driver, but I am getting an read time out when I am querying over 20k rows of data from Cassandra:

File "/Users/etseng/conferenceViewer/conferenceViewer/models/stats.py", line 35, in get_map

tweeted_words = session.execute(self.QUERY_TWEETS % timestamp)

File "/Users/etseng/.virtualenvs/conferenceViewer/lib/python2.7/site-packages/cassandra/cluster.py", line 1295, in execute

result = future.result(timeout)

File "/Users/etseng/.virtualenvs/conferenceViewer/lib/python2.7/site-packages/cassandra/cluster.py", line 2799, in result

raise self._final_exception

ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

Are there settings in the p-c driver that I can set to increase the read timeouts? I read through the documents but didn't find anything related to read timeouts. Thanks!

Regards,

Ed

Adam Holmberg

unread,

Feb 20, 2015, 2:37:31 PM2/20/15

to python-dr...@lists.datastax.com

There are request timeouts on the driver side, but the error you're seeing there is the server-side coordinator timing out. This is governed by settings in the cassandra.yaml file.

That said, I can't really tell anything about your data model (or what your query is doing), but it's usually not a good idea to make such a large request (it can strain the coordinator node to marshal such a large result set). If possible, it would probably be better to split the queries by partition and use concurrent.execute_concurrent.

Adam Holmberg

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Edward Tseng

unread,

Feb 20, 2015, 3:01:39 PM2/20/15

to python-dr...@lists.datastax.com

Thank you for the quick response Adam.

Lets suppose I am storing time series data. Right now, the data set is partitioned by the hours. It is not unusual for my use case to accumulate 20k rows of data on any given hour. Would you suggest further breaking the partition down by the minutes in this case? I am new to Cassandra and I don't really have a good sense of what "large" is.

I did some research online, it seems like the optimal number of rows per partition is perhaps 5k? Is that correct?

Alex Popescu

unread,

Feb 20, 2015, 3:16:35 PM2/20/15

to python-dr...@lists.datastax.com

On Fri, Feb 20, 2015 at 12:01 PM, Edward Tseng <edwar...@gmail.com> wrote:

I am new to Cassandra and I don't really have a good sense of what "large" is.

Maybe this could help http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

--

[:>-a)

Alex Popescu
Sen. Product Manager @ DataStax
@al3xandru

Adam Holmberg

unread,

Feb 20, 2015, 3:18:59 PM2/20/15

to python-dr...@lists.datastax.com

I rows per partition can be >> 5K (depending on data size). Maybe you noticed that the default page size for some drivers is 5000. If you were reading from a single partition, you shouldn't need to break it up, as it should be taking advantage of automatic paging.

If your row size is very large, you might want to reduce the fetch size (there is a default, plus per query settings).

Also make sure you're using token-aware routing to avoid an extra hop.

Adam

Edward Tseng

unread,

Feb 20, 2015, 4:32:34 PM2/20/15

to python-dr...@lists.datastax.com

Okay, by setting the default_fetch_size to a smaller number, the read time out problem went away.

Thanks for your help Adam and Alex, you guys are very helpful. I also found the doc on data modeling very useful. I have read two books on Cassandra (they could be a bit dated,) but I found the use cases of your blog on data modeling to be the most useful! Thank you guys!

Reply all

Reply to author

Forward