GroupBy - merging FastQ files across lanes?

834 views
Skip to first unread message

Marc Hoeppner

unread,
May 5, 2017, 3:33:19 AM5/5/17
to Nextflow
Hi,

am struggling with doing a merging of FastQ files where libraries were split across multiple lanes. I would like to:

- group all fastq files by library
- concatenate the forward and reverse into one file each (something like : zcat $fw_reads | gzip > $merged_fw)
- and then continue my "normal" processing of PE reads

Read group and such are not important to preserve as I am not using this for i.e. GATK variant calling or whatever. 

I am guessing that the "groupBy" operator plays a role, but cannot get it to work. Maybe someone can put me on the right track? As an example of input files:

F15657-L1_S11_L001_R1_001.fastq.gz  F15657-L1_S11_L004_R1_001.fastq.gz  F15658-L1_S12_L003_R1_001.fastq.gz  F15659-L1_S13_L002_R1_001.fastq.gz
F15657-L1_S11_L001_R2_001.fastq.gz  F15657-L1_S11_L004_R2_001.fastq.gz  F15658-L1_S12_L003_R2_001.fastq.gz  F15659-L1_S13_L002_R2_001.fastq.gz
F15657-L1_S11_L002_R1_001.fastq.gz  F15658-L1_S12_L001_R1_001.fastq.gz  F15658-L1_S12_L004_R1_001.fastq.gz  F15659-L1_S13_L003_R1_001.fastq.gz
F15657-L1_S11_L002_R2_001.fastq.gz  F15658-L1_S12_L001_R2_001.fastq.gz  F15658-L1_S12_L004_R2_001.fastq.gz  F15659-L1_S13_L003_R2_001.fastq.gz
F15657-L1_S11_L003_R1_001.fastq.gz  F15658-L1_S12_L002_R1_001.fastq.gz  F15659-L1_S13_L001_R1_001.fastq.gz  F15659-L1_S13_L004_R1_001.fastq.gz
F15657-L1_S11_L003_R2_001.fastq.gz  F15658-L1_S12_L002_R2_001.fastq.gz  F15659-L1_S13_L001_R2_001.fastq.gz  F15659-L1_S13_L004_R2_001.fastq.gz

Should produce something like, grouped into a channel as PE reads with the library ID as "key"

F15657-L1_S11_R1.fastq.gz
F15657-L1_S11_R2.fastq.gz
F15658-L1_S12_R1.fastq.gz
F15658-L1_S12_R2.fastq.gz
F15659-L1_S13_R1.fastq.gz
F15659-L1_S13_R2.fastq.gz


/M

Paolo Di Tommaso

unread,
May 5, 2017, 3:53:44 AM5/5/17
to nextflow
To group all fastq files by library you need to fetch the library id from the file name and then group them together by using groupTuple. For example: 

Channel
     .fromPath('F15657-L1_*.fastq.gz')
     .map { file -> tuple(getLibraryId(file), file) }
     .groupTuple() 
     .set { fastq_ch }

where getLibraryId is a custom function you need to implement to get the lib id from the file name eg: 

def getLibraryId( file ) {
  file.name.substring(0,8) 
}

Or maybe by using a regex or a different strategy depending how complex is this rule. 

Then, to concatenate the forward and reverse into one file each you will need to use a `process` that may use your bash snippet. I'm wondering if this process can get all the fastq having the same library id or your want to process them pair-by-pair. In the latter case it may be required a different approach from the one showed above. 


Cheers, p 

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Marc Hoeppner

unread,
May 5, 2017, 7:55:32 AM5/5/17
to Nextflow
Thanks, the grouping works. However, I am getting something like that out of the channel, as expected:

F15657-L1 : [ F15657-L1_S11_L001_R1_001.fastq.gz, F15657-L1_S11_L001_R2_001.fastq.gz, F15657-L1_S11_L002_R1_001.fastq.gz, F15657-L1_S11_L002_R2_001.fastq.gz, F15657-L1_S11_L003_R2_001.fastq.gz,  F15657-L1_S11_L004_R1_001.fastq.gz,  F15657-L1_S11_L004_R2_001.fastq.gz  ]

I thought I could do something like:

input:
set id,read_files from fastq_ch

and then:
forward_reads = read_files.findAll { file -> file.containts("_R1_") }
reverse_reads = read_files.findAll { file -> file.containts("_R2_") }

Something like that anyway. Doesn't seem to work though - is there a way to do this within one process. I am guessing something like I am trying there *should* work, but  I must be missing something about Groovy and/or Nextflow idiomes. 

Paolo Di Tommaso

unread,
May 5, 2017, 8:32:20 AM5/5/17
to nextflow
In this case maybe is refactor a bit the source channel so that it keeps the pair files separated. The following should work 

Channel
     .fromFilePairs('*.fastq.gz', flat: true)
     .map { prefix, file1, file2 -> tuple(getLibraryId(prefix), file1, file2) }
     .groupTuple() 
     .set { fastq_ch }


The trick here is that fromFilePairs (with flat:true) will emit triples in which the first element is the pair-id, the second the forward read file and finally the reverse read file. 

The remaining logic is the same. 

Finally you will need to declare the following input in the downstream process 

input:
set id, forward_reads, reverse_reads from fastq_ch 



Does it make sense ? 


p

--

Marc Hoeppner

unread,
May 5, 2017, 9:02:03 AM5/5/17
to Nextflow
That seems to work like a charm, thanks!

Paolo Di Tommaso

unread,
May 5, 2017, 9:03:31 AM5/5/17
to nextflow
Cool!

p

On Fri, May 5, 2017 at 3:02 PM, Marc Hoeppner <mphoe...@gmail.com> wrote:
That seems to work like a charm, thanks!

--
Reply all
Reply to author
Forward
0 new messages