How to sink from Kafka topic to CDAP dataset via Spark Structure Streaming

Omar Meza

unread,

Oct 13, 2018, 5:00:27 PM10/13/18

to CDAP User

Hi!
Is it posible to ingest data from Kafka topic into CDAP dataset table via Spark Structure Streaming? Any sample?

Thank you!
Omar

Sanjay

unread,

Oct 16, 2018, 11:51:20 PM10/16/18

to CDAP User

I had asked similar question here - https://groups.google.com/d/msg/cdap-user/85PTHsnptmY/dGgvOTSMBQAJ

As per the answer on that page it seems no.

However my question to CDAP team is

Isn't it possible to

1) Write a kafka streaming source plugin that uses structured streaming library (using code like this - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-streaming-queries )

2) This will return a dataset, which i can convert to RDD

3) Write a sparkCompute component that can work on this RDD

Or maybe I am simplifying things too much and there are lot of complications that I dont foresee ?

Regards,

Sanjay

Albert Shau

unread,

Oct 17, 2018, 1:13:40 PM10/17/18

to cdap...@googlegroups.com

Hi,

You are free to do whatever you want in a custom CDAP application, including reading from Kafka and writing to a dataset using Spark Streaming. This is just like any other spark program, except you need to use the SparkExecutionContext provided by CDAP and call it's saveAsDataset(). You can see an example of writing to a dataset at https://github.com/caskdata/cdap/blob/release/5.1/cdap-examples/SparkPageRank/src/main/java/co/cask/cdap/examples/sparkpagerank/SparkPageRankProgram.java#L148.

One limitation is the sandbox is packaged with Spark 2.1 so if you want to develop locally using newer features, you won't be able to. In distributed mode you'll be able to use whatever spark version is on the cluster.

If you are looking for a way to run a pipeline using structured streaming, then the conversation mentioned by Sanjay applies and you'll have to either wait for us to prioritize and implement it, or contribute it yourself :). Some of the complications involved for that work relate to the fact that we still support older versions of Spark that don't have structured streaming and how to handle unstructured data.

Best,

Albert

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/62d24f60-6437-4777-b4bc-cb7c688e7bcb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward