Is there a limit to the number of rows returned by session.execute?

128 views
Skip to first unread message

Alex Baeza

unread,
Jun 3, 2016, 11:07:15 PM6/3/16
to DataStax Python Driver for Apache Cassandra User Mailing List
I'm trying to do some ETL on a table.

First I do:

data = session.execute(query='select ... from table', timout=None)

Then I write the data to a csv:

with gzip.open(file_name, 'w') as csv_file:
    writer = csv.writer(csv_file)

    # Write to compressed csv in batches of 5000
    for row in data:
        row_count += 1
        row_batch.append(row)
        if row_count % 5000 == 0:
            writer.writerows(row_batch)
            logger.debug('wrote row {} to {}'.format(row_count, file_name))
            row_batch = []

    # Execute remaining rows if any
    if row_batch:
        writer.writerows(row_batch)

Next, I read the csv, do some transformations and then insert the results into another table on another cluster

The problem I keep seeing, is that whenever my source table query is large (more than 2 billion rows), the line count of the csv file is always the max int size (2147483647), so I assume this is some type of session.execute limit?

I'm using the cassandra driver 2.6.0.c1

Any help or suggestions would be much appreciated.

Alternatively if anyone has any other suggestions on how to export more than 2 billion rows to a csv for processing, that would be helpful as well.

Adam Holmberg

unread,
Jun 6, 2016, 10:14:03 AM6/6/16
to python-dr...@lists.datastax.com
I am not aware of any explicit limit in the Python driver. I'm trying it out here.

What version of Cassandra are you running?

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Adam Holmberg

unread,
Jun 6, 2016, 12:02:01 PM6/6/16
to python-dr...@lists.datastax.com
Also, can you share the table structure?

Alex Baeza

unread,
Jun 6, 2016, 1:06:00 PM6/6/16
to python-dr...@lists.datastax.com
I am using DataStax Enterprise Server 4.7.3 (Cassandra 2.1.8.689)

two separate tables exhibit the same behavior:

CREATE TABLE xxxxxxxxxxxxxxxxxx (
  aaaa bigint,
  bbbbbbb int,
  ccccc int,
  ddddddddd bigint,
  xxxxxxxxxxxxxxxxx timestamp,
  xxxxxxxx text,
  xxxxxxxxxx int,
  xxxxxxxxx int,
  PRIMARY KEY ((aaaaa, bbbbbbb), ccccc, dddddddd)
);

CREATE TABLE IF NOT EXISTS yyyyyyyyyyy
(
  aaaaaaaaaaa         BIGINT,
  bbbbbbbbbbbbbb      INT,
  cccccccccccc        INT,
  ddddddddddd         INT,
  eeeeeeeeeeeee       BIGINT,
  ffffffffffffffffff  INT,
  ggggggggggg         BIGINT,
  xxxx                TIMESTAMP,
  xxxxx               FLOAT,
  xxxxxxxxxxxxxxxxxx  INT,
  xxxxxxxx            BIGINT,
  yyyyy               FLOAT,
  PRIMARY KEY
    (
      (aaaaaaaaaaa, bbbbbbbbbbbbbb),
      cccccccccccc,
      ddddddddddd,
      eeeeeeeeeeeee,
      ffffffffffffffffff,
      ggggggggggg
    )
);

Adam Holmberg

unread,
Jun 6, 2016, 5:24:50 PM6/6/16
to python-dr...@lists.datastax.com
The client treats paging state as an opaque blob. I haven't found a way that the driver should be limited in this way. I think you're hitting a server limit. I spent some time looking at paging in your server version. It looks like there is a bug causing PagingState.remaining to decrement monotonically as pages are consumed. That would explain what you are seeing. I haven't found a JIRA ticket, but this issue does not appear in Cassandra 3.x.

If you are not able to upgrade to a working version of the server, another way to circumvent this would be to run multiple queries using the token function to break up the hash space.

Regards,
Adam Holmberg

Alex Baeza

unread,
Jun 6, 2016, 8:12:01 PM6/6/16
to python-dr...@lists.datastax.com
Thanks for the help.

I’ll try re-writing this code to use the token function! :)

Adam Holmberg

unread,
Jun 13, 2016, 9:56:40 AM6/13/16
to python-dr...@lists.datastax.com
I also created this ticket for the server issue: https://issues.apache.org/jira/browse/CASSANDRA-11963
Reply all
Reply to author
Forward
0 new messages