New collect and combine operators

Paolo Di Tommaso

unread,

Mar 6, 2017, 8:46:08 AM3/6/17

to Nextflow

Dear all,

I've just uploaded a new NF snapshot (0.24.0-SNAPSHOT) which includes two new operators:

* the `collect` operator improves the former `toList` operator and it should be used on its place. Read more here.

* the `combine` operator replaces the former `spread` operator. It improves how nested structures are handled and it allows the combinations of items having a common key. Read more here.

This version also include a new feature that allows script parameters to be specified with a json/yaml file.

You can download it by using the following command:

NXF_VER=0.24.0-SNAPSHOT CAPSULE_RESET=1 nextflow info

Then you case use as shown below:

NXF_VER=0.24.0-SNAPSHOT nextflow run .. etc

You feedback is welcome.

Cheers,
Paolo

Mauricio Barrientos

unread,

Oct 4, 2017, 9:30:01 AM10/4/17

to Nextflow

Dear Paolo,

I have a question on the behavior of combine. Combine appears to automatically flatten tuples. Is this intentional?

For example:

num = Channel.from(1,2,3)
tuple_ch = Channel.from( ['a','b'] , ['c','d'] )
num.combine(tuple_ch).println()

will output

[1,'a','b']
[2,'a','b']

[3,'a','b']

[1,'c','d']

[2,'c','d']

[3,'c','d']

Instead of

[1 , ['a','b']]

[2 , ['a','b']]

[3 , ['a','b']]

[1 , ['c','d']]

[2 , ['c','d']]

[3 , ['c','d']]

The behavior is a bit strange, since I could not find reference to this automatic "flattening" in the documentation. (Please let me know if it is somewhere !)

I think also the behavior is a bit undesireable in cases when you want to combine multiple output files from two processes [e.g a BWA index output files and paired ends]

For example, I would expect to be able to write a "map" process like this:

samples = [ ['s1_R1.fq', 's1_R2.fq' ] , [ 's2_R1.fq', 's2_R2.fq' ] ]
ref_idx = [ [ 'bacteria1.aln', 'bacteria1.bwt' ] , [ 'bacteria2.aln', 'bacteria2.bwt' ] ]
mapping_pairs = samples.combine(ref_indexes)
process map{
input:
set file(sample),file(idx) from mapping_pairs
...
}

So that sample would refer to [ s1_R1.fq,s1_R2.fq ] and idx refers to ['bacteria1.aln','bacteria1.bwt']

But with the current behavior what would happen is that sample will be assigned to s1_R1.fq , idx to s1_R2.fq and the index files are ignored and not staged.

I did find a relatively simple workaround which is embedding the tuple in another tuple with map, so defining my channel mapping_pairs as

mapping_pairs = samples.map( { [it,] }).combine( ref_indexes.map( { [it,] })

works, but somehow this threw me a bit off guard .

It would be nice to know your thoughts about this! I apologize if my text is a bit confusing, I am very very new to Nextflow. I have attached also a simple example.nf that illustrates what I mean too. Please let me know about the confusing bits, or if I can help out in any way

Best,

Mauricio

example.nf

Paolo Di Tommaso

unread,

Oct 4, 2017, 5:35:38 PM10/4/17

to nextflow

Hi,

Yes, this is expected. We noticed this behaviour makes easier to handle collections of files eg. read pairs, that is a very common pattern with NF.

Imagine you are handling a read pairs sample. It would a sequence like the following:

['s1', ['s1_R1.fq', 's1_R2.fq' ]]
['s2', [ 's2_R1.fq', 's2_R2.fq' ]]
['s3', [ 's3_R1.fq', 's3_R2.fq' ]]
..

and you need to combine with another grouped files eg.:

[ 's1', [ 'bacteria1.aln', 'bacteria1.bwt' ]]
['s2', [ 'bacteria2.aln', 'bacteria2.bwt' ]]
..

Now for the sake of this example if we want to combine them we would write:

samples = Channel.from(
[ 's1', ['s1_R1.fq', 's1_R2.fq' ]],
['s2', [ 's2_R1.fq', 's2_R2.fq' ]],
['s3', [ 's3_R1.fq', 's3_R2.fq' ]])
ref_idx = Channel.from(
[ 's1', [ 'bacteria1.aln', 'bacteria1.bwt' ]],
['s2', [ 'bacteria2.aln', 'bacteria2.bwt' ]] )
mapping_pairs = samples.combine(ref_idx)
mapping_pairs.println()

and it would prints:

[s1, [s1_R1.fq, s1_R2.fq], s1, [bacteria1.aln, bacteria1.bwt]]
[s2, [s2_R1.fq, s2_R2.fq], s1, [bacteria1.aln, bacteria1.bwt]]
[s3, [s3_R1.fq, s3_R2.fq], s1, [bacteria1.aln, bacteria1.bwt]]
[s1, [s1_R1.fq, s1_R2.fq], s2, [bacteria2.aln, bacteria2.bwt]]
[s2, [s2_R1.fq, s2_R2.fq], s2, [bacteria2.aln, bacteria2.bwt]]
[s3, [s3_R1.fq, s3_R2.fq], s2, [bacteria2.aln, bacteria2.bwt]]

more likely we would like to combine by the sample_id, eg:

mapping_pairs = samples.combine(ref_idx, by:0)
mapping_pairs.println()

which prints:

[s1, [s1_R1.fq, s1_R2.fq], [bacteria1.aln, bacteria1.bwt]]
[s2, [s2_R1.fq, s2_R2.fq], [bacteria2.aln, bacteria2.bwt]]

In this way the result is a triple in which the first element is the sample_id, the second and the third two pairs of files, making easier to handle by a process.

Regarding your example you could, if you combining two channels of pairs you could simply write a process that get as input a quadruple eg.

process foo {

input:

set file(read1), file(read2), file(aln), file(btw) from mapping_pairs

"""

your_command_here
"""

}

Hope it helps.

Cheers,
Paolo

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Mauricio Barrientos

unread,

Oct 5, 2017, 5:40:08 AM10/5/17

to Nextflow

Hi,

Thanks a lot for the reply, it's way clearer now. Keep up the great work!

Cheers,

Mauricio

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Reply all

Reply to author

Forward