For the above use case, my initial thoughts were: Send Job1 & Job2 from independent drivers (since they use two different RDD and are independent of each other), wait for Job1 (using rdd1 action) to get results back to driver, and wait for Job2 result (back to driver) , then feed the merged result as input to Job3.
Is there any suggestions on alternate execution steps, that will avoid coming back to driver and sending Job3 back out to cluster?
My alternate design was to write output to a NoSql like Cassandra/MongoDB to have Job1/2 result persisted, and minimize dependency on one node (driver).
I am still figuring out a design with best latency and fairly fault tolerant.
Appreciate any feedbacks
Thanks
Sijo