How to avoid monolithic pipeline files

459 views
Skip to first unread message

SveinT

unread,
Nov 25, 2014, 7:16:25 AM11/25/14
to next...@googlegroups.com
Hi,

when developing pipelines, they tend to grow in size pretty quickly. I'll have different pipelines sharing some functionality/processes, but I haven't found a way to easily refactor these out to modules or similar things.

How would one do modular design with nextflow, making parts of a pipeline reusable between different pipelines? For an example, I might have the same process for variant calling, but what comes after will be very different between them. Sharing the variant calling pipeline would be handy. Combining all pipelines into one is not an option, the file would be huge and unmaintainable.

Any ideas?

Regards

Paolo Di Tommaso

unread,
Nov 25, 2014, 12:18:21 PM11/25/14
to nextflow
Hi, 

Currently it is only possible to share helper functions by putting them into a Groovy/Java class that can be stored in a "lib" folder in your project root. 

There's an idea to implement a module concept that would allow one to include into a pipeline script one or more external pipeline modules. But this feature has not yet planned. 



Cheers,
Paolo


 

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.




Manuel Holtgrewe

unread,
Dec 9, 2014, 9:44:24 AM12/9/14
to next...@googlegroups.com
Dear Paolo,

it is good to read that this item is on your list.

This feature is of importance to us since we need to have multiple NGS variant calling pipelines that each differ slightly and have a large overlap. I imagine that many other people also have the need for this.

Would it be possible to contribute something like this to Nextflow? Can you give an estimate of how much work that would be?

Thanks,
Manuel

Paolo Di Tommaso

unread,
Dec 10, 2014, 6:49:16 AM12/10/14
to nextflow
Hi Manuel, 

Your contribution is more than welcome. However the main problem is to outline a module/component mechanism that is coherent with the framework design, more than an implementation issue. 
 
Currently this is not yet very clear, maybe it's time to start a discussion about this topic. In my opinion the main points to be taken in considerations are:

1) Component granularity i.e. what a component defines (functions? processes? a flow interaction?)  
2) Interface definition, how the component communicate with the outside world. 
3) Shareability and discovery mechanism, how a component is shared between projects.


It's definitively not an easy tasks, and the first step is to understand what are the users requirements. Could you (and everybody maybe interested) please elaborate a bit more what are your needs and how would you like to manage components in a Nextflow pipeline? 


Also it must be taken in consideration that since Nextflow runs on the JVM and it's an extension of the Groovy programming language, all the mechanisms provided by these environments are available. 

For example it's possible to create a functions library writing a Groovy or a Java class and import it in your script, like you would do in any Groovy/Java source: 

For example, you can create the following class: 

class Library {

    static def foo() {
      return "echo foo"
    }
    
    static def bar() {
        return "echo bar"
    }


And save it as Library.groovy in a folder called "lib/" in your pipeline project root. By doing that Nextflow will add it to your classpath, compile it automatically, and you will be able to use it in your script as shown below: 

import Library 

process Foo {

  script: 
  Library.foo()
  


This can be useful to have a library of helper methods and script wrappers, that can tested at unit level with any of the many tools available in the Java/Groovy community. 

 
But I think you are interested in a higher level component mechanism that is able to "capture" the data flow between different processes. Let me know more about that. 

Also, did you manage to solve the proxy problem you've reported in Github? 


Cheers,
Paolo

Manuel Holtgrewe

unread,
Dec 11, 2014, 8:27:33 AM12/11/14
to next...@googlegroups.com
Hi Paolo,

I think your recommended way fits my needs best. I will create Groovy code that generates the shell commands in the lib folder and then glue this together with the nextflow DSL.

And, yes, I could fix the proxy problem.

Cheers,
Manuel

Andrew Stewart

unread,
Dec 12, 2014, 4:13:59 PM12/12/14
to next...@googlegroups.com
Hi Manuel,

Another option that already exists is to use git's branch structure to manage different versions/combinations of pipelines.  NextFlow supports the ability to run a specific branch, ie "
nextflow run nextflow-io/hello -r mybranch"

Christian Frech

unread,
Apr 29, 2015, 12:15:17 PM4/29/15
to next...@googlegroups.com
Not being able to modularize pipeline scripts is currently holding me back from considering NextFlow for my own bioinformatics pipeline development. Without this feature, different but similar pipelines contain lots of duplicated code that eventually becomes unmaintainable. Modularization would also allow sharing pipeline components between users and could trigger the development of domain-specific NextFlow libraries within the community.

I currently use Anduril, which has both a component mechanism and library in place that could serve as an example. But I have to admit that I really like the slickness of NextFlow code and its seamless docker integration!

What is the current status for this feature? As a start, it would already help if one could simply source in NextFlow files with an include directive. Proper namespaces become an issue though to avoid name conflicts of input and output channels between different processes. Maybe one could just prefix input and output channels with process names? 

Paolo Di Tommaso

unread,
Apr 30, 2015, 6:53:28 AM4/30/15
to nextflow
Hi, 

We are currently testing a templating mechanism that would allow process scripts to be externalised into files separated from the main pipeline script. This will make it possible to better decouple the pipeline logic from tasks specific implementation details and to achieve some degree of scripts reutilisation.  

Regarding the include directive, it is still a pending feature. The trickiest part is to have a coherent way to define the input/output of a sub-flow into the caller process.

Also, I'm not even sure that having an include directive would be the "correct" level of componentization and code reuse we want to achieve with Nextflow. 

Nextflow has been designed to integrate and reuse any piece of software into a single pipeline script. In this model the reusable software component is the process's script, thus an external tool/script that can be eventually packaged into a Docker container. 

A process can even execute native JVM code. In this case a component can be defined and shared by using the mechanisms provided by the underlying Java/Groovy system (classes, jar library, maven repository, etc.).  


However this is a controversial point and still open discussion. So, I'm very happy to receive ideas and contributions regarding that. 


Cheers,
Paolo
 

--

Davide Rambaldi

unread,
May 6, 2015, 3:51:24 AM5/6/15
to next...@googlegroups.com
In bpipe:

- external stages can be loaded on runtime with keyword load

- bpipe automatically load/compile files in BPIPE_LIB (heavy load)

Can't be implemented in similar way in nextflow?

Davide Rambaldi

Paolo Di Tommaso

unread,
May 7, 2015, 5:18:51 AM5/7/15
to nextflow
I would love to review a pull request for that. 

Cheers,
Paolo



Reply all
Reply to author
Forward
0 new messages