How to iterate over parameters created by Channel.combine

1,194 views
Skip to first unread message

olga.bo...@czbiohub.org

unread,
Mar 16, 2019, 10:29:48 PM3/16/19
to Nextflow
Hello,
I've had success setting up Nextflow for my pipeline (yay!), and I'd like to extend it to testing multiple parameters at once. Right now, I'm making a bunch of separate makefiles and copying the script around to rerun it with different parameters, but I'd really like to be able to specify all the parameters I'd like at once with a Cartesian product. I was able to find the `combine` function which does this for channels:

ksizes = Channel.from([15, 21, 27, 33, 51])
molecules = Channel.from(['protein', 'dna'])
log2_sketch_sizes = Channel.from([10, 12, 14, 16])

parameters = molecules
.combine(ksizes)
.combine(log2_sketch_sizes)


Now I'd like to run the following for each set of the parameters. I tried using `each` as used here (https://www.nextflow.io/docs/latest/faq.html#how-do-i-iterate-over-a-process-n-times) but that seemed to be specific for integers.


params.samples = "samples.csv"
params.outdir = "s3://olgabot-maca/nf-kmer-similarity/human_mouse_zebrafish/"

sketch_id = "molecule-${params.molecule}_ksize-${params.ksize}_log2sketchsize-${params.log2_sketch_size}"

if (params.molecule == "protein") {
other_molecule = "dna"
} else {
other_molecule = "protein"
}


Channel
.fromPath(params.samples)
.splitCsv(header:true)
.map{ row -> tuple(row.sample_id, file(row.read1), file(row.read2))}
.set{ samples_ch }


//AWSBatch sanity checking
if(workflow.profile == 'awsbatch'){
    if (!params.awsqueue || !params.awsregion) exit 1, "Specify correct --awsqueue and --awsregion parameters on AWSBatch!"
    if (!workflow.workDir.startsWith('s3') || !params.outdir.startsWith('s3')) exit 1, "Specify S3 URLs for workDir and outdir parameters on AWSBatch!"
}




process sourmash_compute_sketch {
tag "${sketch_id}"
publishDir "${params.outdir}/sketches/${sketch_id}", mode: 'copy'
container 'czbiohub/kmer-hashing'

// If job fails, try again with more memory
memory { 2.GB * task.attempt }
errorStrategy 'retry'

input:
each molecule, ksize, log2_sketch_size from parameters
set sample_id, file(read1), file(read2) from samples_ch

output:
file "${sample_id}.sig" into sourmash_sketches

script:
"""
sourmash compute \
--num-hashes \$((2**$log2_sketch_size)) \
--ksizes $ksize \
--$molecule \
--output ${sample_id}.sig \
--merge '$sample_id' $read1 $read2
"""
}


process sourmash_compare_sketches {
tag "${sketch_id}"

container 'czbiohub/kmer-hashing'
publishDir "${params.outdir}/", mode: 'copy'
memory { 1024.GB * task.attempt }
// memory { sourmash_sketches.size() < 100 ? 8.GB :
// sourmash_sketches.size() * 100.MB * task.attempt}
errorStrategy 'retry'

input:
each molecule, ksize, log2_sketch_size from parameters
file ("sketches/${sketch_id}/*") from sourmash_sketches.collect()

output:
file "similarities_${sketch_id}.csv"

script:
"""
sourmash compare \
        --ksize $ksize \
        --$molecule \
        --csv similarities_${sketch_id}.csv \
        --traverse-directory .
"""

}


I'm confused about how to make the input files a value channel so they can be used multiple times, and how to allow for multiple sourmash_compute_sketches to run, and for sourmash_compare_sketches to run as soon as all the sourmash_compute_sketches for a particular set of parameters has completed.

PS I'm trying to make the memory requirements for sourmash_compare_sketches to be a function of both number of files and total file sizes, e.g.
memory = 10*n_files*file_size_sum

How can one specify this?

Can you help?
Thank you!
Warmest,
Olga

Paolo Di Tommaso

unread,
Mar 18, 2019, 3:13:58 AM3/18/19
to nextflow
The proper way to handle this is with `each` as you have pointed already out. You need to have a separate each statement for each different value range, for example: 

input: 
each ksizes from([15, 21, 27, 33, 51])
each molecule from(['protein', 'dna'])
each log2_sketch_sizes from([10, 12, 14, 16])
file "sketches/${sketch_id}/*" from sourmash_sketches.collect()


Hope it helps. 

p


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Olga Botvinnik

unread,
Mar 18, 2019, 1:05:45 PM3/18/19
to next...@googlegroups.com
Great, thank you! Will the dependencies between functions allow for one of the parameter sets to finish in sourmash compute to then move on to sourmash compare? E. G. If ksize 21, molecule protein and sketch size 10 finishes first in sourmash compare, will sourmash compare start for those parameters? Or will it wait for ALL of sourmash compute to run before doing the compare? 

Paolo Di Tommaso

unread,
Mar 19, 2019, 3:48:46 AM3/19/19
to nextflow
Yes, each task execution is inherently parallel. Therefore downstream tasks start as soon as there's an available result. 


p
Reply all
Reply to author
Forward
0 new messages