complicated splitting and collecting

Eric Davis

unread,

Oct 13, 2017, 12:28:35 PM10/13/17

to Nextflow

Hi Paolo,

I was hoping you could give me some insight as to how to handle a rather complicated (IMO) splitting strategy considering:

-I have samples from tumor, normal pairs.

i.e: (Sample_N, Sample_T)

-Along the way the samples are being split on chromosome as well.

i.e: (Sample_N_1, Sample_N_2, Sample_T_1, ...)

I don't need the T/N pairs to be aware of each other until downstream, so I don't use the FromFilePairs...

process Index {
    
    input:
        file(sample) from samples

    output:
        set sample, file("${sample}.bai") into bam_indices

    """
    samtools index ${sample}
    """
}

process Mpileup {
    
    tag { tumor }

    input:
    set file(bam), file(bai) from bam_indices
    
    each chrom from((1..22), X, Y)
    
    output:
    file "${bam.baseName}.pileup.gz" into sample_pileups
    
    """
    samtools mpileup -r ${chrom} -f ${genome} -Q 20 ${bam} | gzip > ${bam.baseName}_${chrom}.pileup.gz
    """
}

Things get complicated when I need to collect the same Tumor, Normals, also with the same chromosomes in the same command later on. So trying something like:

process Build {

    tag { [tumor, normal] }
    
    input:
    no idea

    output:
    file("${sample.baseName}.gz") into out_files

    """
    some_command -n sample_N_1.bam -t sample_T_1.bam > output_1.bam
    """
}

Using collect or GroupByTuple will get me 80% of the way there, but I just can't seem to pull off. Eventually the outputs from this step will just be concatenated to the final product. Any advice would be greatly appreciated.

Thanks!

Paolo Di Tommaso

unread,

Oct 13, 2017, 3:41:18 PM10/13/17

to nextflow

Hi,

First note the each syntax in the Mpileup process is wrong. I guess you want to do:

each chrom from ([1..3] + ['X', 'Y'])

Then the common idiom with NF to group multiple files is to associate a key with them, in this case the chromosome. Therefore the output in the Mpileup process should be

output:
set chrom, file "${bam.baseName}.pileup.gz" into sample_pileups

By doing that the channel `sample_pileups` emits pairs composed by the chromosome and the mpileup output. Then you can use `groupTuple` to get all the files having the same chromosome eg.

process Build {

input:
set chrom, file(samples) from sample_pileups.groupTuple()

"""
some_command
"""
}

Hope it helps

Cheers,
Paolo

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Eric Davis

unread,

Oct 13, 2017, 4:26:30 PM10/13/17

to Nextflow

HI Paolo,

Thanks so much.

I had tried to go that route with chrom as an index (i had the "each" command right in the read code :)) but I keep getting:

ERROR~ No such variable: chrom

I have exactly as you have it, and I have also tried using "val" qualifier, without success.

Even still, while this approach should group the samples and chromosomes accordingly, i still have the problem of dispatching the Tumor and Normal samples to the correct locations in the subsequent command.

[1,[sample1.T.bam, sample1.N.bam, sample2.T.bam, sample2.N.bam]]
...
command --tumor sample1.T.bam --normal sample1.N.bam

Sorry if this is trivial. I think getting my head around some of these groovy idioms may be one of the harder things i've done on a computer.

Thanks again for your help. I really like nextflow.

Eric

Paolo Di Tommaso

unread,

Oct 16, 2017, 9:37:43 AM10/16/17

to nextflow

Can you post your code somewhere (maybe pastebin.com) with the exact error message ?

p

Eric Davis

unread,

Oct 18, 2017, 6:55:10 PM10/18/17

to Nextflow

I think I got it sorted out. Thank you!

On Friday, October 13, 2017 at 11:28:35 AM UTC-5, Eric Davis wrote:

bruce moran

unread,

Oct 19, 2017, 7:26:12 AM10/19/17

to Nextflow

Hi Eric,

can you post your method please? Interested as a learning exercise.

Thanks,

Bruce.

Paolo Di Tommaso

unread,

Oct 19, 2017, 7:40:33 AM10/19/17

to nextflow

I think the solution mentioned is the one implemented in this example pipeline in the lines highlighted