Thanks Pascal, interesting work.
Good point about serialization. I think I can get around that by creating and running the Process on the worker, rather than on the driver (which would cause closure and serialisation).
Spark streaming isn't a good solution for what I'm doing unfortunately, since it assumes real time (and I need to process at much faster than real time - since I have the data already). The application is machine learning on complex timeseries/streams at scale, and I have the data already.
This also means I need to parallelise the input processing, so streaming into Spark is a bottleneck (see response to Pavel).
More concrete example:
val files = Seq(file1,file2,....)
val rdd = files.parallelize().flatMap(the_thing_I_want_to_do)
rdd should then contain the processed (using my scalaz-stream Process) contents of all the files.
Does that make sense?