Dependent Jobs

20 views

Skip to first unread message

Sijo

unread,

May 21, 2013, 10:56:52 AM5/21/13

to spark...@googlegroups.com

Hi all,

For a Use case of dependent jobs, I am planning this implementation:

Job1 uses my matcher1 using dataset1, Job2 uses matcher2 using dataset2. Need to fuse both results to create dataset3 for Job3.

Is there any available primitive/API to do this kind of wait-for-both-job paradigm.

For good latency for the overall functionality, I was thinking of writing results of Job1/2 to Cassandra table, wait for results, process results to collect dataset3 in Cassandra, then feed that to Job3 using hadoopRdd (as a InputFormat).

Any pointers,

Many thanks,

Sijo

Sijo Cherian

unread,

May 23, 2013, 7:11:35 PM5/23/13

to spark...@googlegroups.com

For the above use case, my initial thoughts were: Send Job1 & Job2 from independent drivers (since they use two different RDD and are independent of each other), wait for Job1 (using rdd1 action) to get results back to driver, and wait for Job2 result (back to driver) , then feed the merged result as input to Job3.

Is there any suggestions on alternate execution steps, that will avoid coming back to driver and sending Job3 back out to cluster?

My alternate design was to write output to a NoSql like Cassandra/MongoDB to have Job1/2 result persisted, and minimize dependency on one node (driver).

I am still figuring out a design with best latency and fairly fault tolerant.

Appreciate any feedbacks

Thanks

Sijo

--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/gWHIhAYBLSQ/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.