Groovy Closure serialization

Dylan Bethune-Waddell

unread,

May 20, 2016, 4:39:14 PM5/20/16

to Nextflow

Hi Paolo,

I have been having trouble getting this to work elsewhere, and as I noticed that you had asked a question on the Kryo mailing list and have an Ignored test and implementation for the Ignite package for Closure serialization so I thought I would ask - did you reach a conclusion about whether this was possible with Kryo? I found this resources which seem to provide a framework for getting it working with a GroovyClosureClassLoader and , though not specifically with Kryo:

1) http://thegridman.com/uncategorized/groovy-oracle-coherence-yeah-baby/

2) http://seeallhearall.blogspot.ca/2012/01/remoting-groovy-with-generated-closures.html -- the above references this one

I am trying to serialize lambda's from a Script class "on the fly" and send them over the wire to Spark executors. I believe building a jar with gradle would solve the problem anyways, but it would be nice to be able to work interactively from the groovy shell.

Best,

Dylan

Paolo Di Tommaso

unread,

May 20, 2016, 5:05:27 PM5/20/16

to nextflow

Hi Dylan,

I don't remember exactly what was the problem however it's definitely possible to serialise a Groovy closure with Kryo. You can find an example at this link.

Out of curiosity is this related to Nextflow? If so I would be interested to learn more about your application stack. Feel free to join the nextflow channel on Gitter to have a chat about that at your convenience.

Cheers,

Paolo

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Dylan Bethune-Waddell

unread,

May 20, 2016, 7:19:32 PM5/20/16

to Nextflow

Hi Paolo,

Thanks for the code sample, it is related to Nextflow in the sense that I am attempting to use Nextflow to execute groovy code that must be serialized and sent to Spark - I will probably have more to say about how my stack integrates with Nextflow in the coming weeks, as of now I haven't gotten past the proof of concept phase and need to attend to a few upcoming deadlines. For now though:

The core or my use case as it relates to Nextflow is to coordinate analysis of our distributed graph store, Titan Distributed Graph Database with Cassandra/Elasticsearch. We have reached a point where it is relatively easy to develop a schema and integration strategy for additional datasets or database dumps, and through Apache TinkerPop we get rather nice general query capabilities through the Gremlin graph traversal query language and OLAP capabilities via the spark-gremlin and hadoop-gremlin modules in TinkerPop. Another benefit of Gremlin is that if we have need of a different graph database backend, such as Neo4j (faster and more flexible data model for OLTP), Stardog (good for ontologies), or BlazeGraph (support for GPU cluster graph computing and we have a few GPUs) - the TinkerPop APIs allow the same Gremlin queries to be executed by the implementing vendor, though it is up to the vendor to implement the APIs and optimize query execution against the backend. Titan also offers a log processor that can be called upon to perform "house cleaning jobs" in response to commits to the database that add or mutate data. My thinking for Nextflow is that it integrates well with the Java/Groovy trend in all the aforementioned technologies, and can recover the ability to run arbitrary scripts or traditional workflows already implemented to use our clusters' LSF/PBS schedulers or otherwise access data housed in our graph system, either as automated house cleaning jobs dispatched by Titan or new workflows. Further, new workflows we implement that go beyond the capabilities of graph query and use some executable, or perl/python script or what have you, can probably reuse more code this way. Ignite is also quite interesting, both on its own and as an on demand cluster to support the execution of Nextflow jobs.

I spent quite a while going through different "data flow" packages in Python and went through the GPars documention, and everything just seemed a little lackluster or onerous - Nextflow seems to have it all covered, hence we see it as something to glue things together nicely without sacrificing capabilities in exchange for the added clarity and ease of use.

This has been my thinking on the subject, but given my level of experience I could well be in error.

Best,

Dylan

Paolo Di Tommaso

unread,

May 21, 2016, 8:22:57 AM5/21/16

to nextflow

Hi Dylan,

I'm asking this because I'm planning to investigate how to integrate Nextflow with Spark in the near future.

The goal would be to provide access to a Spark session from within a Nextflow script in such a way that would be possible to make the two frameworks to interoperate seamlessly.

My idea was to take advantage of the Ignite implementation of the Spark RDD data structure to allow Nextflow to share data with Spark. However the execution contexts of the two tools would remain separate. Nextflow would submit tasks to the Ignite cluster and Spark to its own cluster.

Your solution seems suggesting you are trying to execute nextflow tasks over the Spark cluster. I have the feeling that it could be possible but surely it's more challenging, because you will need to propagate the nextflow classpath to the remote Spark execution nodes.

In an case I would be interested in join forces for this integration. Let me know if you are available for sharing and put together a common effort to implement a built-in support for Spark in Nextflow.