support for structured streaming

sanja...@gmail.com

unread,

Jul 17, 2018, 9:21:59 AM7/17/18

to CDAP User

While CDAP supports spark streaming, I was more interested in knowing when it would support structured streaming.

There are two aspects to this:

1) Spark version: CDAP 4.3.4 supports Spark 2.1 whereas structured streaming is stable from Spark 2.2 onwards. What is the approx time frame when CDAP would start supporting Spark 2.2 ?

2) interface between plugins: CDAP's interface states that plugins talk in terms of RDD[StructuredRecord] . However Structured streaming talks in terms of data frames. While dataframes can be created from RDD, its inefficient to do so for every plugin - create DF from RDD, do the processing, convert it back to RDD, only to do this again in the next plugin.

Has anyone succeeded in writing structured streaming components in CDAP ?

Regards,

Sanjay

Albert Shau

unread,

Jul 30, 2018, 8:53:00 PM7/30/18

to cdap...@googlegroups.com

Hi Sanjay,

CDAP sandbox uses Spark2.1 during execution, but CDAP distributed already supports Spark 2.2. You are free to use structured streaming in any custom CDAP application.

Streaming pipelines are implemented using Spark streaming, but the intent is to hide that detail from users. If/when we move the implementation to structured streaming is a detail that shouldn't affect the plugin APIs. For example, we probably can't use structured streaming for pipelines that don't enforce a structure -- when there isn't a defined schema at every pipeline stage.

Most plugin are abstractions that don't mentioned Spark at all, though there are a few that do operate on Spark constructs (SparkCompute, SparkSink), and they operate on RDD like you said. Exposing APIs where the user is given a Dataframe instead of an RDD is definitely something we have talked about and something we want to do, but it has to be prioritized along with the long list of other features that are desired (https://issues.cask.co/browse/CDAP-7816).

Some current plugins do exist that convert RDDs to Dataframes (see https://github.com/data-integrations/dynamic-spark), but obviously it would be better if plugins don't have to do that conversion.

Best,

Albert

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/189d9a6b-1873-49f6-8e20-ccf0c5502b8a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sanja...@gmail.com

unread,

Aug 3, 2018, 7:28:18 AM8/3/18

to CDAP User

Thanks Albert for the reply.

I do not want to write CDAP Application.

Will it be possible in CDAP Pipeline to create a Kafka Source plugin using structured streaming directives ? I understand that we can convert DataFrame to RDD and connect this with subsequent components.

Regards,

Sanjay

Albert Shau

unread,

Aug 3, 2018, 12:57:29 PM8/3/18

to cdap...@googlegroups.com

Hi Sanjay,

What do you mean by structured streaming directives? When you build a pipeline, you don't write any Spark code -- how it is running internally is hidden from you. You can already read from Kafka in batch or streaming pipelines.

Regards,

Albert

To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/54768587-427e-4be1-8a81-d58026e5a69c%40googlegroups.com.

sanja...@gmail.com

unread,

Aug 4, 2018, 4:29:05 AM8/4/18

to CDAP User

Hi Albert,

I meant below operations which are different from classic streaming

Source: Structured Streaming Kafka

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()

Processing : Structured Streaming Windowing with Watermark

// Group the data by window and word and compute the count of each group
val windowedCounts = df
    .withWatermark("timestamp", "10 minutes")
    .groupBy(
        window($"timestamp", "10 minutes", "5 minutes"),
        $"word")
    .count()

Regards,

Sanjay

Albert Shau

unread,

Aug 6, 2018, 12:48:45 PM8/6/18

to cdap...@googlegroups.com

If you are writing spark plugins, like a streaming source plugin, you will not be able to use the structured streaming sources, since they use incompatible APIs from normal streaming.

From a functionality perspective, most of what you can do in structured streaming is already possible in a streaming pipeline using existing sources. Though the Spark code to read from Kafka is different, that doesn't matter to somebody who is only building a pipeline.

Regards,

Albert

To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/1badecc5-f482-4dc9-8286-c426d02e35c4%40googlegroups.com.

sanja...@gmail.com

unread,

Aug 9, 2018, 12:49:35 AM8/9/18

to CDAP User

Hi Albert,

There are a couple of things that are important to our use case that can be solved using streaming:

1) Support for event time windowing. Classic streaming does this on basis of record time.

2) Support for watermarking that handles delayed data

3) Exactly once end to end guarantee.

These things are very trivial to implement in structured streaming.

As per Spark documentation and other Spark blogs, its recommended to use Structured Streaming for any development.

It would be great if CDAP could support it as well.

Regards,

Sanjay

Reply all

Reply to author

Forward