Efficient ways to load data into Cassandra?

2,320 views
Skip to first unread message

Daniel Hong

unread,
Jan 7, 2017, 7:10:16 PM1/7/17
to DataStax Python Driver for Apache Cassandra User Mailing List
I am trying to find out ways on how you can efficient load data into Cassandra in Python whether it be a 1000 or a million records. I am a beginner to both Cassandra and Python and did read using a batch statement might be a option, but it sounds like that would still load records a row at a time. I've also tried transforming my dataset to a json format, but not sure if there is a way to write a statement that will bulk load into Cassandra. 

Another option I thought of is possibly loading the transformed json data from Python as one big huge text into a Cassandra staging table and then parsing out the data in CQL and loading the final results into a final Cassandra table. 

Would anyone be able to provide their thoughts? Much appreciated!

Greg Bestland

unread,
Jan 9, 2017, 2:11:46 PM1/9/17
to python-dr...@lists.datastax.com
Daniel,

For bulk loading data I would point you to cqlsh copy function. There as been quite a bit of time spent on optimizing this path. 

Linked below is a docs page of how the cqlsh copy command works. 
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html?hl=copy
You don't need json just CSV.

If possible I'd use C* 3.5 or later as that version contains optimizations to improve the import path.
They even mentioned in the keynote of C* summit this year.


I hope this gets you headed in the right direction.

~Thanks
Greg Bestland.

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsub...@lists.datastax.com.



--


Daniel Hong

unread,
Jan 10, 2017, 2:55:00 PM1/10/17
to python-dr...@lists.datastax.com
Hey Greg,

Thanks for sharing your thoughts. The COPY function is convenient to have, but I am also interested in knowing if there are other ways that can help with loading performance. Say for example if I import a million records into Python from a relational DB and then use COPY, I would need to export those million records out to a local or VM which is an added step that takes time and space....Another thing I just thought of is the possibility of using PySpark? Since Spark is good at processing huge data maybe I can combine that with the python cassandra driver?

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsubscribe@lists.datastax.com.



--


Reply all
Reply to author
Forward
0 new messages