Basic Questions about NextFlow

1,943 views
Skip to first unread message

Ted Toal

unread,
Jul 27, 2017, 5:13:27 PM7/27/17
to Nextflow
I've been reviewing several pipeline frameworks to find one that fits our needs.  Many things about NextFlow look very good, but after going through the entire documentation, I find I'm unclear about two areas:

1. Nextflow creates directories for execution of processes, and by default results are stored in these directories.  One can link, move, or copy them elsewhere with publishDir, but anything besides a link would have problems.  Having results in the NF-created directories would be awful, very hard to find things.  I can understand that architecture-wise this might have been done to help with reproducibility.  But, I consider the results awful - how can one easily navigate through perhaps thousands of files in a large project, if they are stored in these obscure directory names?

Supposing one uses publishDir with links, you could then make a directory structure that was navigable, but then, how do you keep the NF-created directories cleaned out of unneeded older runs?  This seems an impossible job.

And suppose you wanted to keep data for several runs.  There doesn't seem to be any mechanism by which to say, "I'd like to see the results from my April 1 run, NF, can you please place links in my folder tree to the files for that run that are located throughout your auto-created directories? (the way the links would have been created with publishDir at the time the run was done)"

Wouldn't it be fairly easy to provide a command that would set the working directory for a process to a user-chosen directory, instead of using the auto-created NF directory?  Not everyone wants to fit themselves into the NF directory scheme, and by forcing that on people, you are severely limiting interest in NF.


2. I want to be able to create MODULES that perform small tasks that are assembled together into a pipeline.  NF seems to almost allow this.  I'm not sure if there is an include statement that would let one include small module files into the main .nf file, but I imagine there is?

Suppose I make a module that I want to reuse multiple times in different places in the pipeline.  How would I do that?  The template mechanism almost seems to be for that purpose, except that it means that a process needs to be created for each different use of the template.  The process itself could be fairly involved, I would think, so it would be nice to re-use it also.  I had thought there would be a way to connect a channel to a process, but this seems to be hard-wired in the sense that the process must contain the name of the channel it uses.  Say that I want a module that indexes a .bam file.  I want to invoke it repeatedly in different places in the pipeline.  How do I do make a module like that?

What I would like to be able to do is have a process be like a subroutine or template, where I could instantiate the process multiple times, each time providing arguments that would be used inside the process template to do things such as select the input and output channels.  Is something like that possible?

Phil Ewels

unread,
Jul 28, 2017, 5:26:27 AM7/28/17
to next...@googlegroups.com
Hi Ted,

Perhaps Paolo will give a more detailed response, but here is my take on your points:

1. Individual work directories are core to how NextFlow works and come with several benefits - you mention reproducibility, but also it makes debugging easier, facilitates simple integration into containers and allows separation of workflow steps to avoid unintended consequences.

We always use a variable name for publishDir (`params.outdir`) and set that differently for each run with `--outdir`, so that results from different runs are kept separately. You can also specify where the work directories should be kept with `-w`. You can use the command `nextflow clean` to help with the removal of pipeline results from specific runs (use in combination with `nextflow log` to see run names).


2. Modules / sub-workflows are a much-requested feature for NF that are not currently supported but are being worked on (as I understand it, may be wrong). However, it should be possible to run a process several times with multiple different inputs. I'll leave it to Paolo to give more input on this as I haven't written it myself so don't have any particular recommendations.


I hope this helps! And welcome (potentially) to the Nextflow community..

Phil
--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Paolo Di Tommaso

unread,
Jul 28, 2017, 10:00:37 AM7/28/17
to nextflow
Hi all, 

I'm adding a few notes to Phil answer, quoting the original question. 

> But, I consider the results awful - how can one easily navigate through perhaps thousands of files in a large project, if they are stored in these obscure directory names?

NF executes each task in its own unique work directory, for many reasons already mentioned by Phil. It must also be noted this mechanism enable to parallelise the tasks in a safe and lock-free manner and to resume the pipeline execution seamlessly, in a consistent manner ie. without retain partial outputs following an unexpected error. 

Apart the fact there are many ways to locate the directory where a task has been executed eg. the NF stdout which prints the task name name and ID (which matches with the directory prefix), the `nextflow log` command and the execution report.

Said that I think there's an important point to realise, the NF work directory is not meant to hold the pipeline outputs, but the pipeline intermediate results, which can/should be removed once your workflow has been successfully executed. 

Pipeline outputs are meant to the managed (and structured) by using the publishDir definition. Regarding the point of symlinks, what we do is to consolidate the pipeline outputs once the pipeline is successfully executed, replacing each symlink with the original file with a trivial bash script (there are a lot of example googling).


There doesn't seem to be any mechanism by which to say, "I'd like to see the results from my April 1 run, [..]

Maybe not exactly what you are asking but you may want to give a look at `nextflow log` and `nextflow log <run name>` commands


> Wouldn't it be fairly easy to provide a command that would set the working directory for a process to a user-chosen directory, instead of using the auto-created NF directory?  Not everyone wants to fit themselves into the NF directory scheme, and by forcing that on people, you are severely limiting interest in NF.

Again, NF is not a general purpose software. It's a framework designed for a very specify use case ie. to enable seamless scalability and parallelisation of existing tools and scripts. It's based on functional/reactive model that's uncommon to other framework and it may require a bit of time to get used to it and to see its befits.

Said that, the storeDir allows users to store data on a directory of their choice, but it comes with some caveats and it should not to be considered as alternative to the NF default directory structure mechanism. 


I want to be able to create MODULES that perform small tasks that are assembled together into a pipeline.  NF seems to almost allow this.  I'm not sure if there is an include statement that would let one include small module files into the main .nf file, but I imagine there is?

NF allows you to compose one or more commands by using template files as you are correctly referring. A template is simple text file which can contain NF variables which allows you to parametrise it and it can be used in different processes. However you will still need to declare the process definition in your script. 

Inclusion of a NF script into other NF script it's not supported at this time, but it's an open effort. However it's still possibile to invoke a NF pipeline from a NF process like any other command.   


Hope it helps 


Cheers,
Paolo
 


To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Message has been deleted

Ted Toal

unread,
Jul 28, 2017, 2:04:21 PM7/28/17
to Nextflow
Thanks for detailed responses, both of you!

Could you say more about what you mean by "simple integration into containers"?

I'd missed the -outdir option and the "clean" command.  Those together with publishDir using links might allow me to use NF the way I want.

It seems to me that the advantage of separation of workflow steps to avoid unintended consequences doesn't preclude that AFTER the steps are executed, the results couldn't be moved to a different perhaps non-separated directory, where they would be retrieved for the following steps that depend on those files  Overwriting of files accidentally between steps could be detected when that move is performed.  I still feel that the advantages of your separate directory structure are being way over-emphasized and the problems with it under-emphasized.

I recently read (BROAD inst.??) that it is recommended that the files for each PERSON in a bioinformatic analysis of human sequence data be kept together in one folder, separate from those of each other person.  It makes sense and provides for a nice organization, which I've done in my current project.  Totally impossible to do in NF (but you CAN organize LINKS to the files in that manner).

You make a strong distinction between intermediate and final results, but I don't really consider many of the intermediate results discardable or useless after the fact.  I often find myself going back to look at intermediate files.  For example, after calling variants, one might say that the variants file is the final result and .bam mapping files are intermediate results no longer needed, but that isn't true, the .bam files are still valuable and often used.  I don't feel comfortable deleting ANY intermediate results.  If they are that intermediate, they probably shouldn't even exist - should be flowing through a pipe in a piped command and never appear as a file.

Using NF stdout and the log command are not really solving the problem of navigating results.  People navigate file systems all the time, looking for certain files.  If they are scattered like NF does it, no one is going to want to refer to a log file every time they want to find one file.  Use of publishDir is absolutely essential with NF, for almost all of the pipeline output files.  And use of -outdir with every run is almost essential.  Those facts are underemphasized in your user manual.

I think you should have, as almost the first thing in the documentation, a thorough description of the manner in which directories are created and organized by NF, the reasons for it, and its pros and cons, and a recommended usage method involving publishDir with links and -outdir to create a structured output folder tree for each complete pipeline run.  That stuff is a foundation upon which NF is built, and the user needs to have an understanding of it from the start.


Regarding the point of symlinks, what we do is to consolidate the pipeline outputs once the pipeline is successfully executed, replacing each symlink with the original file with a trivial bash script (there are a lot of example googling).

I think you mean that after running the pipeline, you would move files to overwrite their links in the publishDir.  To me this is unsatisfactory.  It means that you are saving a snapshot of that one pipeline run (that's fine) but are now unable to use the intermediate results (which you've moved) of that run in a new run, so a new pipeline run means running the entire thing again.  The alternative is to have publishDir do copy instead of symlink, but then you're talking lots of storage.  I think having symlinks in the publishDir structured folder tree is a reasonable compromise and the best way to do it.


Again, NF is not a general purpose software.

I don't think you should be limiting your concept of what this is.  There is a huge need for a good flexible workflow management system.  There are at least 36 such systems out there (I've been trying to review the better of these).  All have some great "pros" to them, but even the best always seem to have a strong "con".  I think it is possible to have a single system that is so good that it eliminates the competition, and NF might be in the running to do that.  Right now the biggest problem I see with NF is lack of good support for modularization.  You've more or less convinced me that the odd directory structuring is not a problem provided publishDir is used with -outdir and the "clean" command is used to remove unwanted runs.



Paolo Di Tommaso

unread,
Jul 31, 2017, 10:41:02 AM7/31/17
to nextflow
Hi,

quoting as before: 

Could you say more about what you mean by "simple integration into containers"?

NF has a built-in support for Docker and Singularity containers technology. This allows you to isolate the pipeline dependencies by ore or more container images to simplify the deployment and enable the reproducibility of your pipeline. 

I recently read (BROAD inst.??) that it is recommended that the files for each PERSON in a bioinformatic analysis of human sequence data be kept together in one folder, separate from those of each other person.  It makes sense and provides for a nice organization, which I've done in my current project.  Totally impossible to do in NF (but you CAN organize LINKS to the files in that manner).

PublishDir allows to organise any data in your pipeline wherever you want. There's any restriction in the directory structure you can create. 

I don't feel comfortable deleting ANY intermediate results.  If they are that intermediate, they probably shouldn't even exist - should be flowing through a pipe in a piped command and never appear as a file.

Put a rule `output: file '*'` in your processes to track all the outputs as you need.  

  I think having symlinks in the publishDir structured folder tree is a reasonable compromise and the best way to do it.

Good.

Right now the biggest problem I see with NF is lack of good support for modularization

We are planning to  implement a proper modularization mechanism in NF. 



Out of curiosity how have you found nextflow and what other framework have you tested? 


Best,
Paoo




Ted Toal

unread,
Jul 31, 2017, 2:01:59 PM7/31/17
to Nextflow
Good, thanks for the answers.

I don't remember where I first ran across NextFlow, but it is #2 in my list of some 30+ workflow management systems, so I must have come across it early.

The main frameworks I've been looking at are:

SnakeMake
NextFlow
Bpipe
BigDataScript
ClusterFlow
Cromwell

At this point, I'm leaning towards using SnakeMake.  Some of my earlier concerns about it have gone now that I understand it better.  Clusterflow is very nice but seems to be too simple to permit enough flexibility.

I've created a mini-test-project to use for testing prospective workflow frameworks, and I'm trying to run it using SnakeMake as my first test.  One of my key requirements is modularability.  When NF implements some features to support this, it will be much more in the running for me.

I also had these on my list, but dismissed most of them fairly quickly, many because they were too new or too amateurish or poorly supported, had very few GitHub commits, or were focussed on a particular application domain.  I should probably revisit a couple of these:

Ruffus
Leaf
PaPy
Qiime
SNAPR
TREVA
GenePattern
Taverna
UGENE WorkFlow
VisTrails
Anduril
Biomake
Bioqueue
Briefly
ClusterJob
Conan2
Consecution
COSMOS
Dagobah
Dask
Xp
Flowr
eHive
Kronos
Loom
Luigi
Mistral
Moa
OpenGe
Toil

Ted


Paolo Di Tommaso

unread,
Jul 31, 2017, 3:24:16 PM7/31/17
to nextflow
It sounds a good material for a paper. Are you planning to publish a review or something like that? 

I suggestion for your comparison, don't forget to evaluate the scalability of these tools.  



Cheers,
Paolo
 

Ted Toal

unread,
Jul 31, 2017, 3:55:50 PM7/31/17
to Nextflow
Oh, no, I'm not, shouldn't have given that impression.  It would be a good thing to review, but much more work than the limited review I'm doing. This is just for my lab.  There are so many factors to evaluate in order to do a thorough review!  It would be quite a job!
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages