I've been reviewing several pipeline frameworks to find one that fits our needs. Many things about NextFlow look very good, but after going through the entire documentation, I find I'm unclear about two areas:
1. Nextflow creates directories for execution of processes, and by default results are stored in these directories. One can link, move, or copy them elsewhere with publishDir, but anything besides a link would have problems. Having results in the NF-created directories would be awful, very hard to find things. I can understand that architecture-wise this might have been done to help with reproducibility. But, I consider the results awful - how can one easily navigate through perhaps thousands of files in a large project, if they are stored in these obscure directory names?
Supposing one uses publishDir with links, you could then make a directory structure that was navigable, but then, how do you keep the NF-created directories cleaned out of unneeded older runs? This seems an impossible job.
And suppose you wanted to keep data for several runs. There doesn't seem to be any mechanism by which to say, "I'd like to see the results from my April 1 run, NF, can you please place links in my folder tree to the files for that run that are located throughout your auto-created directories? (the way the links would have been created with publishDir at the time the run was done)"
Wouldn't it be fairly easy to provide a command that would set the working directory for a process to a user-chosen directory, instead of using the auto-created NF directory? Not everyone wants to fit themselves into the NF directory scheme, and by forcing that on people, you are severely limiting interest in NF.
2. I want to be able to create MODULES that perform small tasks that are assembled together into a pipeline. NF seems to almost allow this. I'm not sure if there is an include statement that would let one include small module files into the main .nf file, but I imagine there is?
Suppose I make a module that I want to reuse multiple times in different places in the pipeline. How would I do that? The template mechanism almost seems to be for that purpose, except that it means that a process needs to be created for each different use of the template. The process itself could be fairly involved, I would think, so it would be nice to re-use it also. I had thought there would be a way to connect a channel to a process, but this seems to be hard-wired in the sense that the process must contain the name of the channel it uses. Say that I want a module that indexes a .bam file. I want to invoke it repeatedly in different places in the pipeline. How do I do make a module like that?
What I would like to be able to do is have a process be like a subroutine or template, where I could instantiate the process multiple times, each time providing arguments that would be used inside the process template to do things such as select the input and output channels. Is something like that possible?