Running only part of a nextflow pipeline

2,631 views
Skip to first unread message

Stephan Schmeing

unread,
Dec 21, 2016, 11:27:42 AM12/21/16
to Nextflow
Hi,

is it possible to run only part of a nextflow pipeline, like in make with giving a specific target?

For example you have three processes:
countKmers
histKmers
dumpKmers

The output from countKmers is the input for histKmers and dumpKmers.

I would like to specify now that I want histKmers to be run. So it runs first countKmers and then histKmers, but does not run dumpKmers.

So is that somehow possible?

Thanks,
Stephan

Paolo Di Tommaso

unread,
Dec 21, 2016, 12:34:37 PM12/21/16
to nextflow
Hi Stephan, 

No out of the box. However it's possible to implement a custom condition of process executions by using a when directive. You could use it with a parameter to enable/disable the execution of a certain process (and implicitly all the ones the depends on it).


Hope it helps.

Cheers,
Paolo


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Stephan Schmeing

unread,
Dec 22, 2016, 6:35:15 AM12/22/16
to Nextflow
Thanks for the fast answer. I hope doing it manually with when is not too much of an overhead if the project grows.

A follow up question:
I have 2 processes:
downloadData
gzipData

I use a storeDir for gzipData, but when I remove the work directory and restart it still does the downloadData process. Is there a way to depend the downloadData process on the gzipData process, so it is only run when the gzipData process needs to be run? Of course I can use when to check manually for the files created in gzipData to exist, but again that's not so nice overhead.

Paolo Di Tommaso

unread,
Dec 22, 2016, 7:33:01 AM12/22/16
to nextflow
Yes, you can manage that with the storeDir directive. 


Cheers,
Paolo

--

Stephan Schmeing

unread,
Dec 22, 2016, 7:51:30 AM12/22/16
to Nextflow
And how?

I have it like this:
/*
 * Download the data
 */
process downloadData{
    output:
        file "${params.data}_{1,2}.fastq" into raw_data mode flatten
       
    """
    fastq-dump --split-files ${params.data}
    """
}

/*
 * Gunzip the data
 */
process gzipData{
    storeDir 'data/00_raw'

    input:
        file fqFile from raw_data
   
    output:
        file "${params.data}_{1,2}.fq.gz" into raw_data_gz
       
    """
    gzip -c -9 $fqFile > ${fqFile.toString().replaceFirst('fastq', 'fq')}.gz
    """
}

I cannot find anything that stops downloadData from being executed except a manual when for ${params.data}_{1,2}.fq.gz exist in 'data/00_raw'

Paolo Di Tommaso

unread,
Dec 22, 2016, 8:12:01 AM12/22/16
to nextflow
My apologies, I misread your previous question. 

You should use storeDir in the `downloadData` not in the `gzipData`. In this way if the data is already available in the directory specified as `storeDir` the process process is skipped, as a consequence also `gzipData` won't be executed. 

A couple of comments on your code snippet: 
  • You can simplify the output file declaration as:  file "*_{1,2}.fq.gz"
  • Also the the replacement can be written as:  fqFile.name.replaceFirst('fastq', 'fq')

Cheers,
p



--

Stephan Schmeing

unread,
Dec 22, 2016, 9:17:08 AM12/22/16
to Nextflow
Thanks for the comments. I will simplify the code, but if I would use storeDir on the downloadData it stores the uncompressed data, which is on the one hand not what I want and on the other hand it would then restarting the pipeline after removing the work directory with gzipData.


The idea is to have certain waypoints in my long pipeline where I actually permanently store the data and then can remove the huge work directory to save disc space. If I then later return to that pipeline and want to add stuff I want to start from the closest waypoint and only do that part of the pipeline that is new, not all the intermediate steps I deleted, because I do not need them anymore. The storeDir does the creation of the waypoints perfectly, but I cannot find anything to control the workflow so it starts at the waypoint, not repeating all intermediate steps before the waypoint.

Paolo Di Tommaso

unread,
Dec 22, 2016, 9:27:25 AM12/22/16
to nextflow
In this case I would suggest to merge `downloadData` and `gzipData` in a single process. Since they are sequential tasks there's any benefit in using two separate processes and in this way you could store the compressed data. 



p



On Thu, Dec 22, 2016 at 3:17 PM, Stephan Schmeing <stephan....@gmail.com> wrote:
Thanks for the comments. I will simplify the code, but if I would use storeDir on the downloadData it stores the uncompressed data, which is on the one hand not what I want and on the other hand it would then restarting the pipeline after removing the work directory with gzipData.


The idea is to have certain waypoints in my long pipeline where I actually permanently store the data and then can remove the huge work directory to save disc space. If I then later return to that pipeline and want to add stuff I want to start from the closest waypoint and only do that part of the pipeline that is new, not all the intermediate steps I deleted, because I do not need them anymore. The storeDir does the creation of the waypoints perfectly, but I cannot find anything to control the workflow so it starts at the waypoint, not repeating all intermediate steps before the waypoint.

--

Stephan Schmeing

unread,
Dec 22, 2016, 10:08:31 AM12/22/16
to Nextflow
As I have two data files I like to execute gzip in parallel. Which is not happening, when I do it sequentially.


Am Donnerstag, 22. Dezember 2016 15:27:25 UTC+1 schrieb Paolo Di Tommaso:
In this case I would suggest to merge `downloadData` and `gzipData` in a single process. Since they are sequential tasks there's any benefit in using two separate processes and in this way you could store the compressed data. 



p


On Thu, Dec 22, 2016 at 3:17 PM, Stephan Schmeing <stephan....@gmail.com> wrote:
Thanks for the comments. I will simplify the code, but if I would use storeDir on the downloadData it stores the uncompressed data, which is on the one hand not what I want and on the other hand it would then restarting the pipeline after removing the work directory with gzipData.


The idea is to have certain waypoints in my long pipeline where I actually permanently store the data and then can remove the huge work directory to save disc space. If I then later return to that pipeline and want to add stuff I want to start from the closest waypoint and only do that part of the pipeline that is new, not all the intermediate steps I deleted, because I do not need them anymore. The storeDir does the creation of the waypoints perfectly, but I cannot find anything to control the workflow so it starts at the waypoint, not repeating all intermediate steps before the waypoint.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Dec 23, 2016, 8:46:00 AM12/23/16
to nextflow
Good point. Unfortunately there's no magic trick to handle this with NF. 

I would in any case merge `downloadData` and `gzipData` in a single process, and compress the files in parallel with parallel


p

Kemin Zhou

unread,
May 21, 2022, 12:22:41 PM5/21/22
to Nextflow

I am also having similar needs.  My pipeline is very long.  Sometimes, we need to reprocess data for the later steps.  But usually we change the nextflow script somewhere before that point or we might have removed data before that point.  Resume the pipeline would start from the beginning.  One possible solution is to write a shell script, guided by the log file, to pick the run tree of the pipeline from a certain point, and go the the work directory of these process in the order of the tree postorder traversal, then reissue the .comman.sh.  In such cases, we changed the underlying binary to be executed by .command.sh Not sure this is in the write way of thinking (is this a reasonable design).
Reply all
Reply to author
Forward
0 new messages