I have 125x125 bp paired-end reads of which most should align to the human genome. The source material is also host to a retroviral infection, so I am very interested in first aligning to hg19 and then using the unaligned "leftover" material for subsequent viral alignments. This human first, virus later approach ensures I don't reuse reads, drastically speeds things up downstream, and is useful for identifying integration sites (read hg19, mate viral).
When I align with nvBowtie and default settings...
nvBowtie --device 0 -x hg19-nvidia -1 sequences/AP_ACTGAT_L005_R1_001.fastq.gz -2 sequences/AP_ACTGAT_L005_R2_001.fastq.gz -S AP-hg19-nvBowtie.bam
...I have BAM output which should have SAM flags for every read to tell me the order and orientation of hg19 (or lack of) alignment.
When I look at these reads, there's a ton of SAM flag 4...
(note: nvBowtie maintains the full original sequence read name which contains a space; this space means SAM flag is now $3)
samtools view AP-hg19-nvBowtie.bam | awk '{a[$3]++} END {for (b in a) {print b "\t" a[b]}}'
4 2466371
65 159821
81 841704
83 7535086
85 1
87 13
97 803652
99 7639630
101 9
105 285777
109 7
113 27406
121 82472
129 159821
137 224604
141 7
145 803660
147 7639619
149 1
151 11
153 55929
157 1
161 841695
163 7535099
165 10
177 27406
If I look at examples of SAM flag 4 reads, I quickly discover a few things.
1) Every line present in the BAM with SAM flag 4 has 1:N:0:ACTGAT for $2.
2) No matter what, R2 is never present in the final BAM file; nvBowtie replaces it with a copy of R1 in reverse-compliment.
Here is an example of this problem:
HISEQ:145:C81UYANXX:5:1101:6933:1991 1:N:0:ACTGAT 4 * 0 0 * * 0 0 GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAGTGTAATACACAGTCATCCGTTCACATAAGACGGAAGTAAGGTCGACAT #####################################################FB/B/FFBBFF<FBFFFBB<7BFFFFFFFFFF/FF<FBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB
HISEQ:145:C81UYANXX:5:1101:6933:1991 1:N:0:ACTGAT 4 * 0 0 * * 0 0 ATGTCGACCTTACTTCCGTCTTATGTGAACGGATGACTGTGTATTACACTTTTTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FF/FFFFFFFFFFB7<BBFFFBF<FFBBFF/B/BF#####################################################
Sequence "HISEQ:145:C81UYANXX:5:1101:6933:1991" should have the following sequences reported:
zcat AP_ACTGAT_L005_R1_001.fastq.gz | grep -A 3 HISEQ:145:C81UYANXX:5:1101:6933:1991
@HISEQ:145:C81UYANXX:5:1101:6933:1991 1:N:0:ACTGAT
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAGTGTAATACACAGTCATCCGTTCACATAAGACGGAAGTAAGGTCGACAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FF/FFFFFFFFFFB7<BBFFFBF<FFBBFF/B/BF#####################################################
zcat AP_ACTGAT_L005_R2_001.fastq.gz | grep -A 3 HISEQ:145:C81UYANXX:5:1101:6933:1991
@HISEQ:145:C81UYANXX:5:1101:6933:1991 2:N:0:ACTGAT
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAACACACTAAGGCGAACCTCACACCAGTCTTGATGCTCACTACTCAGCACGTCTGTGTGT
+
BBBBBFFFFFFFFFFFBFBFFFF<FFFFFFFFBFFFFFBFFFFFF<B7FF/FF<FFB/<FFFBFFFFF##########################################################
You'll note that the second sequence is completely missing in the BAM output (I searched) and is instead replaced with the reverse compliment of R1. There is no reason for this activity...
If I align instead with bowtie2...
bowtie2 --threads 15 -x /media/evo/genomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome -1 sequences/AP_ACTGAT_L005_R1_001.fastq.gz -2 sequences/AP_ACTGAT_L005_R2_001.fastq.gz -S AP-hg19-bowtie2.sam
...I get the following SAM flag output:
(note: SAM headers were skipped with the !~ condition below)
cat AP-hg19-bowtie2.sam | awk '$1 !~ /^@/ {a[$2]++} END {for (b in a) {print b "\t" a[b]}}'
65 64153
69 122761
73 108047
77 905228
81 1193470
83 7405195
89 109696
97 1192505
99 7400675
113 63176
129 64153
133 217743
137 60889
141 905228
145 1192505
147 7400675
153 61872
161 1193470
163 7405195
177 63176
There's still some "weirdness" (flags 73 and 89 both mate with 133) but everything is tagged by bowtie2 in a way that I can easily grab pairs where both are unmapped (77 / 141), or where one end maps while the other doesn't (73|89 / 133).
Most importantly, bowtie2 doesn't substitute fake the 2nd read for failed mappings.
Remember "HISEQ:145:C81UYANXX:5:1101:6933:1991"? Both R1 and R2 are properly represented in the bowtie2 output.
cat AP-hg19-bowtie2.sam | grep HISEQ:145:C81UYANXX:5:1101:6933:1991
HISEQ:145:C81UYANXX:5:1101:6933:1991 77 * 0 0 * * 0 0 GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAGTGTAATACACAGTCATCCGTTCACATAAGACGGAAGTAAGGTCGACAT BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FF/FFFFFFFFFFB7<BBFFFBF<FFBBFF/B/BF##################################################### YT:Z:UP
HISEQ:145:C81UYANXX:5:1101:6933:1991 141 * 0 0 * * 0 0 GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAACACACTAAGGCGAACCTCACACCAGTCTTGATGCTCACTACTCAGCACGTCTGTGTGT BBBBBFFFFFFFFFFFBFBFFFF<FFFFFFFFBFFFFFBFFFFFF<B7FF/FF<FFB/<FFFBFFFFF########################################################## YT:Z:UP
nvBowtie is great for getting the stuff that *does* align and the output is nearly identical to bowtie2 at 3x the speed. This sloppy handling of unmapped reads, however, really puts a damper on trusting any output that isn't flagged 83 / 163 | 99 / 147.
Let me know what I can contribute to help fix this problem.