Thanks for the code sample, it is related to Nextflow in the sense that I am attempting to use Nextflow to execute groovy code that must be serialized and sent to Spark - I will probably have more to say about how my stack integrates with Nextflow in the coming weeks, as of now I haven't gotten past the proof of concept phase and need to attend to a few upcoming deadlines. For now though:
The core or my use case as it relates to Nextflow is to coordinate analysis of our distributed graph store, Titan Distributed Graph Database with Cassandra/Elasticsearch. We have reached a point where it is relatively easy to develop a schema and integration strategy for additional datasets or database dumps, and through Apache TinkerPop we get rather nice general query capabilities through the Gremlin graph traversal query language and OLAP capabilities via the spark-gremlin and hadoop-gremlin modules in TinkerPop. Another benefit of Gremlin is that if we have need of a different graph database backend, such as Neo4j (faster and more flexible data model for OLTP), Stardog (good for ontologies), or BlazeGraph (support for GPU cluster graph computing and we have a few GPUs) - the TinkerPop APIs allow the same Gremlin queries to be executed by the implementing vendor, though it is up to the vendor to implement the APIs and optimize query execution against the backend. Titan also offers a log processor that can be called upon to perform "house cleaning jobs" in response to commits to the database that add or mutate data. My thinking for Nextflow is that it integrates well with the Java/Groovy trend in all the aforementioned technologies, and can recover the ability to run arbitrary scripts or traditional workflows already implemented to use our clusters' LSF/PBS schedulers or otherwise access data housed in our graph system, either as automated house cleaning jobs dispatched by Titan or new workflows. Further, new workflows we implement that go beyond the capabilities of graph query and use some executable, or perl/python script or what have you, can probably reuse more code this way. Ignite is also quite interesting, both on its own and as an on demand cluster to support the execution of Nextflow jobs.
I spent quite a while going through different "data flow" packages in Python and went through the GPars documention, and everything just seemed a little lackluster or onerous - Nextflow seems to have it all covered, hence we see it as something to glue things together nicely without sacrificing capabilities in exchange for the added clarity and ease of use.
This has been my thinking on the subject, but given my level of experience I could well be in error.
Best,
Dylan