Is there a limit to the number of rows returned by session.execute?

Alex Baeza

unread,

Jun 3, 2016, 11:07:15 PM6/3/16

to DataStax Python Driver for Apache Cassandra User Mailing List

I'm trying to do some ETL on a table.

First I do:

data = session.execute(query='select ... from table', timout=None)

Then I write the data to a csv:

with gzip.open(file_name, 'w') as csv_file:

writer = csv.writer(csv_file)

# Write to compressed csv in batches of 5000

for row in data:

row_count += 1

row_batch.append(row)

if row_count % 5000 == 0:

writer.writerows(row_batch)

logger.debug('wrote row {} to {}'.format(row_count, file_name))

row_batch = []

# Execute remaining rows if any

if row_batch:

writer.writerows(row_batch)

Next, I read the csv, do some transformations and then insert the results into another table on another cluster

The problem I keep seeing, is that whenever my source table query is large (more than 2 billion rows), the line count of the csv file is always the max int size (2147483647), so I assume this is some type of session.execute limit?

I'm using the cassandra driver 2.6.0.c1

Any help or suggestions would be much appreciated.

Alternatively if anyone has any other suggestions on how to export more than 2 billion rows to a csv for processing, that would be helpful as well.

Adam Holmberg

unread,

Jun 6, 2016, 10:14:03 AM6/6/16

to python-dr...@lists.datastax.com

I am not aware of any explicit limit in the Python driver. I'm trying it out here.

What version of Cassandra are you running?

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Adam Holmberg

unread,

Jun 6, 2016, 12:02:01 PM6/6/16

to python-dr...@lists.datastax.com

Also, can you share the table structure?

Alex Baeza

unread,

Jun 6, 2016, 1:06:00 PM6/6/16

to python-dr...@lists.datastax.com

I am using DataStax Enterprise Server 4.7.3 (Cassandra 2.1.8.689)

two separate tables exhibit the same behavior:

CREATE TABLE xxxxxxxxxxxxxxxxxx (

aaaa bigint,

bbbbbbb int,

ccccc int,

ddddddddd bigint,

xxxxxxxxxxxxxxxxx timestamp,

xxxxxxxx text,

xxxxxxxxxx int,

xxxxxxxxx int,

PRIMARY KEY ((aaaaa, bbbbbbb), ccccc, dddddddd)

);

CREATE TABLE IF NOT EXISTS yyyyyyyyyyy

(

aaaaaaaaaaa BIGINT,

bbbbbbbbbbbbbb INT,

cccccccccccc INT,

ddddddddddd INT,

eeeeeeeeeeeee BIGINT,

ffffffffffffffffff INT,

ggggggggggg BIGINT,

xxxx TIMESTAMP,

xxxxx FLOAT,

xxxxxxxxxxxxxxxxxx INT,

xxxxxxxx BIGINT,

yyyyy FLOAT,

PRIMARY KEY

(

(aaaaaaaaaaa, bbbbbbbbbbbbbb),

cccccccccccc,

ddddddddddd,

eeeeeeeeeeeee,

ffffffffffffffffff,

ggggggggggg

)

);

Adam Holmberg

unread,

Jun 6, 2016, 5:24:50 PM6/6/16

to python-dr...@lists.datastax.com

The client treats paging state as an opaque blob. I haven't found a way that the driver should be limited in this way. I think you're hitting a server limit. I spent some time looking at paging in your server version. It looks like there is a bug causing PagingState.remaining to decrement monotonically as pages are consumed. That would explain what you are seeing. I haven't found a JIRA ticket, but this issue does not appear in Cassandra 3.x.

If you are not able to upgrade to a working version of the server, another way to circumvent this would be to run multiple queries using the token function to break up the hash space.

Regards,

Adam Holmberg

Alex Baeza

unread,

Jun 6, 2016, 8:12:01 PM6/6/16

to python-dr...@lists.datastax.com

Thanks for the help.

I’ll try re-writing this code to use the token function! :)

Adam Holmberg

unread,

Jun 13, 2016, 9:56:40 AM6/13/16

to python-dr...@lists.datastax.com

I also created this ticket for the server issue: https://issues.apache.org/jira/browse/CASSANDRA-11963

Reply all

Reply to author

Forward