How to change Array into RDD in Scala

2,775 views
Skip to first unread message

John Miller

unread,
Jan 24, 2017, 1:39:27 PM1/24/17
to spark-conn...@lists.datastax.com
All,


I'm trying to save the spark dataframe to Cassandra table.And I able to call the spark dataframe into Array But I don't know how to convert Array into RDD with double quotes.Can someone please help me on below.

My Array Looks like below.
=======================

df2: Array[Array[String]] = Array(Array(00000000140000003414, I), Array(00000000140000003583, I), Array(00000000140000003900, U), Array(00000000140000004042, D), Array(00000000140000004194, I))

I want to change above like below.
========================
sc.parallelize(Seq(("00000000140000003414", "I"),("00000000140000003583", "I"),("00000000140000003900"," U"),("00000000140000003900", "U"),("00000000140000004194"," I"))

Thanks,
John

Eric Meisel

unread,
Jan 24, 2017, 1:45:12 PM1/24/17
to spark-conn...@lists.datastax.com
This is more of a general Spark question, though you should consider doing a direct translation between your dataframe and RDD. If you cast a dataframe into an Array first (I assume you are using .collect()?), you will be collecting all of that information into the driver, thus losing the power of distributed processing.

You can return an RDD[Row] from a dataframe by using the provided .rdd function. You can also call a .map() on the dataframe and map the Row object to the values that you expect.


--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Eric Meisel

unread,
Jan 24, 2017, 1:46:42 PM1/24/17
to spark-conn...@lists.datastax.com
To perform the conversion that you're looking to do (using your variable names):

df2.toSeq.map( x: Array[_] => x.map(_.toString) )

Eric Meisel

unread,
Jan 24, 2017, 1:52:26 PM1/24/17
to spark-conn...@lists.datastax.com
Sorry - I see that you are trying to merge the arrays, and that it is already a string. You can use the .flatten operation for this:

df2.toSeq.flatten

John Miller

unread,
Jan 24, 2017, 2:10:00 PM1/24/17
to spark-conn...@lists.datastax.com
Thank You for the prompt response Eric.I'm new to Scala and Spark.Here is what I trying to do from my Json file.


{
"table":"OGG_BDATA.BTEST","op_type":"I","op_ts":"2017-01-15 20:27:41.054429","current_ts":"2017-01-15T14:29:11.069000","pos":"00000000140000003414","after":{"ID":"1","CDATE":"2017-01-15:14:27:42","NAME":"aabbcc","C_FILE":"ddffgg"}}
{"table":"OGG_BDATA.BTEST","op_type":"I","op_ts":"2017-01-15 20:28:48.054429","current_ts":"2017-01-15T14:29:11.410000","pos":"00000000140000003583","after":{"ID":"2","CDATE":"2017-01-15:14:28:50","NAME":"aabbcc","C_FILE":"ddffgg"}}
{"table":"OGG_BDATA.BTEST","op_type":"U","op_ts":"2017-01-15 20:32:44.054377","current_ts":"2017-01-15T14:32:48.533000","pos":"00000000140000003900","before":{"ID":"2","CDATE":"2017-01-15:14:28:50","NAME":"aabbcc"},"after":{"ID":"2","CDATE":"2017-01-15:14:28:50","NAME":"aabbcc","C_FILE":"rreeff"}}
{"table":"OGG_BDATA.BTEST","op_type":"D","op_ts":"2017-01-15 20:33:58.054432","current_ts":"2017-01-15T14:34:03.579000","pos":"00000000140000004042","before":{"ID":"2","CDATE":"2017-01-15:14:28:50","NAME":"aabbcc"}}
{"table":"OGG_BDATA.BTEST","op_type":"I","op_ts":"2017-01-16 02:58:58.054389","current_ts":"2017-01-15T20:59:03.334000","pos":"00000000140000004194","after":{"ID":"3","CDATE":"2017-01-15:20:59:01","NAME":"aabbcc","C_FILE":"ddffgg"}}


val conf = new SparkConf(true).set("spark.cassandra.connection.host","10.0.0.203").set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df= sqlContext.read.json("file:///home/edureka/spark-1.5.2/jason.data")
val df1 = df.select(df("pos"),df("op_type"))
val df2 = Seq(df1.map {_.toSeq.map {_.toString}.toArray})

----> Can you help me here  How to read that into RDD and saveToCassandra Table.(In this case,I will get 5 rows in Cassandra table)

My Cassandra table looks like below

select * from df_test;
-------------------------------------------------------------
pos                                                      | op_type
--------------------------------------------------------------------------                    
 00000000140000003900                   |  U

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Eric Meisel

unread,
Jan 24, 2017, 2:38:27 PM1/24/17
to spark-conn...@lists.datastax.com
Well, you don't need the saveToCassandra call, so there's no need to use RDDs at all - you can just save the dataframe itself:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#persisting-a-dataframe-to-cassandra-using-the-save-command



To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

John Miller

unread,
Jan 24, 2017, 4:17:08 PM1/24/17
to spark-conn...@lists.datastax.com
Thank You Eric.Let me try

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Reply all
Reply to author
Forward
0 new messages