Comparison of RDD saveToCassandra vs DataFrame write

649 views
Skip to first unread message

Shiva Achari

unread,
Mar 23, 2017, 6:56:47 AM3/23/17
to spark-conn...@lists.datastax.com
Hi,

What is a better approach in terms of saving the data in a cassandra table 
  1. repartitionByCassandraReplica function and write as SaveToCassandra  
  2. DataFrame .write.format("org.apache.spark.sql.cassandra").mode(SaveMode.Append)
If we have any benchmark results please share.

Thanks and Regards,
S
​hiva Achari

swati

unread,
Mar 24, 2017, 1:21:10 AM3/24/17
to spark-conn...@lists.datastax.com
Hi Shiva,

I have save dataframe into elasticsearch index by following method. I hope it would be helpful for you.

Dataframe.write.format('org.elasticsearch.spark.sql').mode('append').option('es.index.auto.create','true').option('es.resource','index/typei').save())

On Thu, Mar 23, 2017 at 4:26 PM, Shiva Achari <shiva....@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (shiva....@gmail.com) Add cleanup rule | More info
--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.




--

Thanks and regards, 

Swati Saini,
Bachelor of Technology 
IIT Kharagpur, 2016
+91 70766 07599 | LinkedIn

Russell Spitzer

unread,
Mar 24, 2017, 1:28:44 AM3/24/17
to spark-conn...@lists.datastax.com
Save to Cassandra and DataFrame write are essentially the same operation and use the same underlying code. The key difference is all the operations done in DataFrames will most likely be more efficient than in RDDs because of the Catalyst Optimizer and tungsten row format. That said, there are some operations that aren't able to be done in Dataframes like joinWithCassandraTable that necessitate dropping down into RDDs.

My recommendation would be to stick to Dataframes unless you need the expanded utility in joinWithCassandraTable or are utilizing some kind of CassandraPartitioning using spanBy.

RepartitionByCassandraReplica is an additional shuffle which may or may not benefit your use case and pure sort on partition key is probably more efficient but is really unrelated to either of the above save methods.



On Thu, Mar 23, 2017 at 10:21 PM swati <sainis...@gmail.com> wrote:
Hi Shiva,

I have save dataframe into elasticsearch index by following method. I hope it would be helpful for you.

Dataframe.write.format('org.elasticsearch.spark.sql').mode('append').option('es.index.auto.create','true').option('es.resource','index/typei').save())

On Thu, Mar 23, 2017 at 4:26 PM, Shiva Achari <shiva....@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (shiva....@gmail.com) Add cleanup rule | More info

Hi,

What is a better approach in terms of saving the data in a cassandra table 
  1. repartitionByCassandraReplica function and write as SaveToCassandra  
  2. DataFrame .write.format("org.apache.spark.sql.cassandra").mode(SaveMode.Append)
If we have any benchmark results please share.

Thanks and Regards,
S
​hiva Achari

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.




--

Thanks and regards, 

Swati Saini,
Bachelor of Technology 
IIT Kharagpur, 2016
+91 70766 07599 | LinkedIn

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




DS_Sig2.png

vincent gromakowski

unread,
Mar 24, 2017, 3:37:08 AM3/24/17
to spark-conn...@lists.datastax.com
Hi Russel
Is it still true when you strongly manage end to end custom partitioning with RDD[k,v] ? I am not sure dataframe can keep partitioning and I have noticed much more shuffle ops...

Le 24 mars 2017 6:28 AM, "Russell Spitzer" <rus...@datastax.com> a écrit :
Save to Cassandra and DataFrame write are essentially the same operation and use the same underlying code. The key difference is all the operations done in DataFrames will most likely be more efficient than in RDDs because of the Catalyst Optimizer and tungsten row format. That said, there are some operations that aren't able to be done in Dataframes like joinWithCassandraTable that necessitate dropping down into RDDs.

My recommendation would be to stick to Dataframes unless you need the expanded utility in joinWithCassandraTable or are utilizing some kind of CassandraPartitioning using spanBy.

RepartitionByCassandraReplica is an additional shuffle which may or may not benefit your use case and pure sort on partition key is probably more efficient but is really unrelated to either of the above save methods.



On Thu, Mar 23, 2017 at 10:21 PM swati <sainis...@gmail.com> wrote:
Hi Shiva,

I have save dataframe into elasticsearch index by following method. I hope it would be helpful for you.

Dataframe.write.format('org.elasticsearch.spark.sql').mode('append').option('es.index.auto.create','true').option('es.resource','index/typei').save())

On Thu, Mar 23, 2017 at 4:26 PM, Shiva Achari <shiva....@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (shiva....@gmail.com) Add cleanup rule | More info

Hi,

What is a better approach in terms of saving the data in a cassandra table 
  1. repartitionByCassandraReplica function and write as SaveToCassandra  
  2. DataFrame .write.format("org.apache.spark.sql.cassandra").mode(SaveMode.Append)
If we have any benchmark results please share.

Thanks and Regards,
S
​hiva Achari

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.




--

Thanks and regards, 

Swati Saini,
Bachelor of Technology 
IIT Kharagpur, 2016
+91 70766 07599 | LinkedIn

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




DS_Sig2.png

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Goutham Dasari

unread,
Mar 24, 2017, 7:06:06 AM3/24/17
to DataStax Spark Connector for Apache Cassandra
Sorry to join in between , Is there any scala version of saving dataframe data to cassandra table ?

i have a SQL result run on cassandra table in a dataframe. I would like to load this back to a new table in cassandra. I am using scala in spark-shell.


thanks
Goutham

Russell Spitzer

unread,
Mar 24, 2017, 12:52:23 PM3/24/17
to DataStax Spark Connector for Apache Cassandra
@Vincent, you lose partitioning completely in dataframes, but this isn't neccessarily an issue if you have to do multiple shuffles anyway :) 

@Goutham, yes it's in the docs
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages