Hi Edoardo,
I recently had to solve exactly the same problem.The trick is to replace the first value 'sample_id' in your fastqfiles channel by value 'groupKey(sample_id, number_of_parts)'.
This can be done in two steps:
(1) create a second channel 'fastqfiles_count' with values 'sample_id, number_of_parts':
fastqfiles_count = Channel
.map{row -> return tuple(row[0], file(row[1])))}
.unique()
.groupTuple()
.map{sample, list_F -> [sample, groupKey(sample, list_F.size())]
(in your case the 'unique' step can probably be dropped)
(2) combine the 'fastqfiles_count' and 'fastqfiles' channels by the 'sample_id' key and drop that key to put the groupKey in front position:
fastqfile_new = fastqfiles_count
.combine(fastqfiles, by:0)
.map{it -> [it.get(1), it.get(2), it.get(3)] }
Then continue as before, replacing fastqfiles by fastqfiles_new, but leaving the rest of the code as it is:
BWA(fastqfiles_new, genome_data)
MERGE_BAMS(BWA.out.bam_file.groupTuple())
I have not tested exactly the code above, but groupTuple should now cause MERGE_BAMS to start as soon as possible for every individual sample rather than wait for completion of the BWA step for all samples.
I would be happy to hear about a more elegant solution. And this could also serve as gentle reminder to the Nextflow team that a groupKey example in the documentation would indeed be very welcome....
kind regards,
--luc