DSL2: glob path, multiple input paths

792 views
Skip to first unread message

Adam Streck

unread,
Jan 19, 2022, 6:11:04 AM1/19/22
to Nextflow
Hi,

I'm trying to create a process that would run over multiple files in a glob patter, while using additional files. As an example consider:

process haplotypeCaller {
    input:
    path bam
    path ref
   
    output:
    path "${bam.baseName}.vcf.gz"

    """
    gatk HaplotypeCaller -I $bam -R $ref -O ${bam.baseName}.vcf.gz
    """
}
I would like to call this using the following command haplotypeCaller(Channel.fromPath("*.bam"), Channel.fromPath("hg38.fa"))
the problem here is that even if I have multiple bam files, only the first one will get used, because I only provide one reference file. It seems that without DSL2 I can solve this using "each ... from" to repeat the process. However DSL2 is doing this automatically and I don't know how to specify to use one reference file for each input file.

Anand

unread,
Jan 19, 2022, 3:59:26 PM1/19/22
to next...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/b57d9d95-a81f-4e70-acd4-8aaf89980ee7n%40googlegroups.com.

Harshil Patel

unread,
Jan 19, 2022, 4:15:19 PM1/19/22
to next...@googlegroups.com

Hi Adam,

Can you try:

haplotypeCaller(Channel.fromPath("*.bam"), file("hg38.fa"))



Adam Streck

unread,
Jan 19, 2022, 6:09:33 PM1/19/22
to Nextflow
Thank you!

Using 

haplotypeCaller(Channel.fromPath("*.bam"), file("hg38.fa"))
does exactly what I was looking for.

harshi...@seqera.io

unread,
Jan 20, 2022, 4:48:35 AM1/20/22
to next...@googlegroups.com

 

Awesome! In general, you should only use “Channel.fromPath” if you need to evaluate a glob condition for files, otherwise you should use “file”.

Adam Streck

unread,
Jan 20, 2022, 5:12:25 AM1/20/22
to Nextflow
Understood, thank you. I actually ran now into a subsequent issue, maybe you'd also know how to resolve it?

I noticed that if I use the glob pattern and a process is executed multiple times, the order of execution is non-deterministic. E.g. I may get [file1.vcf, file2.vcf] on the output, or [file2.vcf, file1.vcf].

The issue is that I have multiple parallel processes and some processes that take inputs from multiple channels produced by previous processes. I need to make sure that the output files are always in the same order.

I noticed that the output of nextflow actually shows which was the last file, so sometimes I get e.g.
[e6/ec7698] process > txtFiles (2) [100%] 2 of 2
or on another execution:
[60/bc05a6] process > txtFiles (1) [100%] 2 of 2

meaning that the execution was [file1, file2] and [file2, file1] respectively. Any way how to force it's always [file1, file2]?

drhp...@gmail.com

unread,
Jan 20, 2022, 9:27:36 AM1/20/22
to Nextflow

No worries! Nextflow is asynchronous by nature which is why as soon as the input data for a process is ready it will get processed.

If I understood you correctly you should enforce the sort order of the files in a channel within the process itself. This way it doesn't affect the other aspects of the pipeline. e.g. picard/mergesamfiles. This also means that the process will be cached correctly when using "-resume" because the order will always be the same. 
Reply all
Reply to author
Forward
0 new messages