I am trying to map all of the read 1 and read 2 fastq files for samples for merging and downstream analysis. My standard workflow starts with a samplesheet that looks like this:
HapMap-B17-1267,fastq/HapMap-B17-1267_S8_L001_R1_001.fastq.gz,fastq/HapMap-B17-1267_S8_L001_R2_001.fastq.gz
HapMap-B17-1267,fastq/HapMap-B17-1267_S8_L002_R1_001.fastq.gz,fastq/HapMap-B17-1267_S8_L002_R2_001.fastq.gz
HapMap-B17-1267,fastq/HapMap-B17-1267_S8_L003_R1_001.fastq.gz,fastq/HapMap-B17-1267_S8_L003_R2_001.fastq.gz
HapMap-B17-1267,fastq/HapMap-B17-1267_S8_L004_R1_001.fastq.gz,fastq/HapMap-B17-1267_S8_L004_R2_001.fastq.gz
NTC-H2O,fastq/NTC-H2O_S1_L001_R1_001.fastq.gz,fastq/NTC-H2O_S1_L001_R2_001.fastq.gz
NTC-H2O,fastq/NTC-H2O_S1_L002_R1_001.fastq.gz,fastq/NTC-H2O_S1_L002_R2_001.fastq.gz
NTC-H2O,fastq/NTC-H2O_S1_L003_R1_001.fastq.gz,fastq/NTC-H2O_S1_L003_R2_001.fastq.gz
NTC-H2O,fastq/NTC-H2O_S1_L004_R1_001.fastq.gz,fastq/NTC-H2O_S1_L004_R2_001.fastq.gz
SeraCare-1to1-Positive,fastq/SeraCare-1to1-Positive_S2_L001_R1_001.fastq.gz,fastq/SeraCare-1to1-Positive_S2_L001_R2_001.fastq.gz
SeraCare-1to1-Positive,fastq/SeraCare-1to1-Positive_S2_L002_R1_001.fastq.gz,fastq/SeraCare-1to1-Positive_S2_L002_R2_001.fastq.gz
SeraCare-1to1-Positive,fastq/SeraCare-1to1-Positive_S2_L003_R1_001.fastq.gz,fastq/SeraCare-1to1-Positive_S2_L003_R2_001.fastq.gz
SeraCare-1to1-Positive,fastq/SeraCare-1to1-Positive_S2_L004_R1_001.fastq.gz,fastq/SeraCare-1to1-Positive_S2_L004_R2_001.fastq.gz
I need to merge all of the R1 and R2 fastq.gz files per sample (the first and second columns shown) into a single R1 and R2 file, per sample.
I tried using this code:
params.fastq_raw_sheet = "samples.fastq-raw.test.csv"
Channel.fromPath( file(params.fastq_raw_sheet) )
.splitCsv()
.map { row ->
def sample_ID = row[0]
def read1 = file(row[1])
def read2 = file(row[2])
return [ sample_ID, read1, read2 ]
}
.groupTuple()
.into { sample_fastq_r1r2; sample_fastq_r1r2_2 }
sample_fastq_r1r2_2.println()
process fastq_merge {
executor "local"
echo true
input:
set val(sample_ID), file("*") from sample_fastq_r1r2
script:
"""
echo "${sample_ID} - \$(pwd)"
echo "*"
# cat "*" > "${sample_ID}_R1.fastq.gz"
"""
}
But it does not give the desired output; it results in this:
./nextflow run wes.nf
N E X T F L O W ~ version 0.27.2
Launching `wes.nf` [chaotic_baekeland] - revision: 21e8474842
[HapMap-B17-1267, [fastq/HapMap-B17-1267_S8_L001_R1_001.fastq.gz, fastq/HapMap-B17-1267_S8_L002_R1_001.fastq.gz, fastq/HapMap-B17-1267_S8_L003_R1_001.fastq.gz, fastq/HapMap-B17-1267_S8_L004_R1_001.fastq.gz], [fastq/HapMap-B17-1267_S8_L001_R2_001.fastq.gz, fastq/HapMap-B17-1267_S8_L002_R2_001.fastq.gz, fastq/HapMap-B17-1267_S8_L003_R2_001.fastq.gz, fastq/HapMap-B17-1267_S8_L004_R2_001.fastq.gz]]
[NTC-H2O, [fastq/NTC-H2O_S1_L001_R1_001.fastq.gz, fastq/NTC-H2O_S1_L002_R1_001.fastq.gz, fastq/NTC-H2O_S1_L003_R1_001.fastq.gz, fastq/NTC-H2O_S1_L004_R1_001.fastq.gz], [fastq/NTC-H2O_S1_L001_R2_001.fastq.gz, fastq/NTC-H2O_S1_L002_R2_001.fastq.gz, fastq/NTC-H2O_S1_L003_R2_001.fastq.gz, fastq/NTC-H2O_S1_L004_R2_001.fastq.gz]]
[SeraCare-1to1-Positive, [fastq/SeraCare-1to1-Positive_S2_L001_R1_001.fastq.gz, fastq/SeraCare-1to1-Positive_S2_L002_R1_001.fastq.gz, fastq/SeraCare-1to1-Positive_S2_L003_R1_001.fastq.gz, fastq/SeraCare-1to1-Positive_S2_L004_R1_001.fastq.gz], [fastq/SeraCare-1to1-Positive_S2_L001_R2_001.fastq.gz, fastq/SeraCare-1to1-Positive_S2_L002_R2_001.fastq.gz, fastq/SeraCare-1to1-Positive_S2_L003_R2_001.fastq.gz, fastq/SeraCare-1to1-Positive_S2_L004_R2_001.fastq.gz]]
[warm up] executor > local
WARN: Input tuple does not match input set cardinality declared by process `fastq_merge` -- offending value: [HapMap-B17-1267, [fastq/HapMap-B17-1267_S8_L001_R1_001.fastq.gz, fastq/HapMap-B17-1267_S8_L002_R1_001.fastq.gz, fastq/HapMap-B17-1267_S8_L003_R1_001.fastq.gz, fastq/HapMap-B17-1267_S8_L004_R1_001.fastq.gz], [fastq/HapMap-B17-1267_S8_L001_R2_001.fastq.gz, fastq/HapMap-B17-1267_S8_L002_R2_001.fastq.gz, fastq/HapMap-B17-1267_S8_L003_R2_001.fastq.gz, fastq/HapMap-B17-1267_S8_L004_R2_001.fastq.gz]]
[8b/6ec3ac] Submitted process > fastq_merge (3)
[52/ba4e43] Submitted process > fastq_merge (1)
[03/d7f736] Submitted process > fastq_merge (2)
SeraCare-1to1-Positive - work/8b/6ec3ac9e6fc21c1d2ce20b4759fa9e
*
HapMap-B17-1267 - work/52/ba4e431819317209da5112c685b6a6
*
NTC-H2O - work/03/d7f736e7ef5fa9af35f2e1398a0a4f
*
cleaned up a little bit, the array output here looks like this:
[NTC-H2O,
[fastq/NTC-H2O_S1_L001_R1_001.fastq.gz, fastq/NTC-H2O_S1_L002_R1_001.fastq.gz, fastq/NTC-H2O_S1_L003_R1_001.fastq.gz, fastq/NTC-H2O_S1_L004_R1_001.fastq.gz], [fastq/NTC-H2O_S1_L001_R2_001.fastq.gz, fastq/NTC-H2O_S1_L002_R2_001.fastq.gz, fastq/NTC-H2O_S1_L003_R2_001.fastq.gz, fastq/NTC-H2O_S1_L004_R2_001.fastq.gz]]
Instead, I need something like a nested map, like this:
[NTC-H2O :[
R1 : [fastq/NTC-H2O_S1_L001_R1_001.fastq.gz, fastq/NTC-H2O_S1_L002_R1_001.fastq.gz, fastq/NTC-H2O_S1_L003_R1_001.fastq.gz, fastq/NTC-H2O_S1_L004_R1_001.fastq.gz], R2 : [fastq/NTC-H2O_S1_L001_R2_001.fastq.gz, fastq/NTC-H2O_S1_L002_R2_001.fastq.gz, fastq/NTC-H2O_S1_L003_R2_001.fastq.gz, fastq/NTC-H2O_S1_L004_R2_001.fastq.gz]
]]
and I need my Process to take an input that is something like
input:
set val(sample_ID), val(R1R2_ID), file("*")
output:
set val(sample_ID), val(R1R2_ID), file("${sample_ID}_${R1R2_ID}.fastq.gz") into sample_merged_fastqs
e.g. I need to output the individual files "NTC-H2O_R1.fastq.gz" and "NTC-H2O_R2.fastq.gz" along with their associated metadata
Any suggestions or ideas on how to implement this?