I'm trying to do some ETL on a table.
First I do:
data = session.execute(query='select ... from table', timout=None)
Then I write the data to a csv:
with gzip.open(file_name, 'w') as csv_file:
writer = csv.writer(csv_file)
# Write to compressed csv in batches of 5000
for row in data:
row_count += 1
row_batch.append(row)
if row_count % 5000 == 0:
writer.writerows(row_batch)
logger.debug('wrote row {} to {}'.format(row_count, file_name))
row_batch = []
# Execute remaining rows if any
if row_batch:
writer.writerows(row_batch)
Next, I read the csv, do some transformations and then insert the results into another table on another cluster
The problem I keep seeing, is that whenever my source table query is large (more than 2 billion rows), the line count of the csv file is always the max int size
(2147483647), so I assume this is some type of session.execute limit?
I'm using the cassandra driver 2.6.0.c1
Any help or suggestions would be much appreciated.
Alternatively if anyone has any other suggestions on how to export more than 2 billion rows to a csv for processing, that would be helpful as well.