Feel free to use both the Flink and Tez work as a starting point.
I’m happy to offer suggestions on the project as well.
Just as a couple things to consider…
Spark is imperative, in the sense that jobs fire as side effects of building the work load. for example, calling something like saveAsHadoopFile (i doubt thats the actual signature) will immediately fire off the job. great for a repl, challenging when building a declarative layer over the imperative one.
subsequently, the order of calling the imperative calls can effect performance. the Cascading planner will allow for rules that can signify job/step schedule ordering as meta-data to the Cascading scheduler/orchestrator (we might need a patch for this in Cascading, no other platform needs it).
also, because of the sensitivity of ordering, there will need to be rules to mark intermediate RDDs or Stages as cached. note much of Spark is really just overcoming the limitations of both being imperative and only supporting directed in-tree graphs on jobs — that is, not supporting forks within jobs (I haven’t looked at Spark 2.0, hopefully they spent their time lifting this limitation instead of providing yet another api).
good news is you probably only need to create two custom RDDs that would parallelize on splits or on hash partitions to run the associated pipeline (once you dig into the planner, this will make a bit more sense). and then probably RDDs to wrap Tap instance (this is where i’m still a bit fuzzy) or call out to equivalent RDDs.
this might be of help
https://github.com/cwensel/notebook/blob/public/cluster-computing.adoc
https://github.com/cwensel/notebook/blob/public/cluster-technologies.adoc
I still need to update the table for Spark 1.x and 2.x — unless I get a PR for it.
ckw
> To view this discussion on the web visit
https://groups.google.com/d/msgid/cascading-user/5331a39a-2a9c-4627-8367-eaaf6c8c0ca3%40googlegroups.com.