Spark Streaming: Fault-tolerant join on a sliding window

1,011 views

Skip to first unread message

Ilya Katsov

unread,

Aug 7, 2013, 7:40:26 AM8/7/13

to spark...@googlegroups.com, us...@spark.incubator.apache.org

Hi All,

We are building a system that consumes two streams of events and joins these streams on a time window of about 30 minutes (it is about 100GB of retained data). The logic of join operation is a follows:

- events from the first stream are kept in a buffer for 30 minutes

- events from the second stream are joined with the buffer on some key and joined records are emitted downstream

- if the event from the first stream had not been joined with at least one event from the second stream in 30 minutes, it is emitted downstream and evicted from the buffer

Implementation should guarantee fault-tolerance and absence of lost/duplicated events at the output. It is possible to implement such a pipeline using Spark? Is Spark able to provide fault-tolerance in this case and how it works?

Thanks,

Ilya

Tathagata Das

unread,

Aug 9, 2013, 1:12:36 AM8/9/13

to spark...@googlegroups.com, us...@spark.incubator.apache.org

Hi Ilya,

I think this can be done in this way. First apply the window operation on the first event stream to create a new windowed stream. Then this window stream is joined against the second event stream, to create another stream. This stream will have all the joined data. The code will look something like this, probably a few more maps and other transformations as well.

val windowedEventStream = firstEventStream.window(Minutes(30)) // Each RDD of this DStream will be the unified data of last 30 minutes.

val joinedStream = secondEventStream.join(windowedEventStream)

joinedStream.foreach( joinedRDD => {

// do something with the joined RDD, maybe push it to a database, etc.

})

Underneath since all the data transformations are executed as deterministic RDD transformation, you automatically get exact once guarantee for all transformation (map, join, reduce, fitler, etc.) despite worker failures. For the output operation like foreach (shown above), that operation may get executed more than once. Unlike the deterministic, functional (i.e. external side-effect free) transformations like map, and reduce, output operations are not side-effect free as arbitrary computation can be down with the data. So its upto the user to output data in the operation in such a way that ensure exactly-once semantics For example, if you are updating a database, you could design it such that updates are idempotent - updating the same record twice with the same data does not make a difference.

For a more detailed discussion on fault-tolerance properties, take a look at http://spark-project.org/docs/latest/streaming-programming-guide.html#fault-tolerance-properties

Let me know if you need more clarifications.

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

0 new messages