Improper SAM flags on unmapped paired end reads

103 views

Skip to first unread message

Kyle Nilson

unread,

Dec 10, 2015, 1:23:40 PM12/10/15

to nvbio-users

I have 125x125 bp paired-end reads of which most should align to the human genome. The source material is also host to a retroviral infection, so I am very interested in first aligning to hg19 and then using the unaligned "leftover" material for subsequent viral alignments. This human first, virus later approach ensures I don't reuse reads, drastically speeds things up downstream, and is useful for identifying integration sites (read hg19, mate viral).

When I align with nvBowtie and default settings...

nvBowtie --device 0 -x hg19-nvidia -1 sequences/AP_ACTGAT_L005_R1_001.fastq.gz -2 sequences/AP_ACTGAT_L005_R2_001.fastq.gz -S AP-hg19-nvBowtie.bam

...I have BAM output which should have SAM flags for every read to tell me the order and orientation of hg19 (or lack of) alignment.

When I look at these reads, there's a ton of SAM flag 4...

(note: nvBowtie maintains the full original sequence read name which contains a space; this space means SAM flag is now $3)

samtools view AP-hg19-nvBowtie.bam | awk '{a[$3]++} END {for (b in a) {print b "\t" a[b]}}'

4	2466371
65	159821
81	841704
83	7535086
85	1
87	13
97	803652
99	7639630
101	9
105	285777
109	7
113	27406
121	82472
129	159821
137	224604
141	7
145	803660
147	7639619
149	1
151	11
153	55929
157	1
161	841695
163	7535099
165	10
177	27406

If I look at examples of SAM flag 4 reads, I quickly discover a few things.

1) Every line present in the BAM with SAM flag 4 has 1:N:0:ACTGAT for $2.

2) No matter what, R2 is never present in the final BAM file; nvBowtie replaces it with a copy of R1 in reverse-compliment.

Here is an example of this problem:

HISEQ:145:C81UYANXX:5:1101:6933:1991 1:N:0:ACTGAT	4	*	0	0	*	*	0	0	GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAGTGTAATACACAGTCATCCGTTCACATAAGACGGAAGTAAGGTCGACAT	#####################################################FB/B/FFBBFF<FBFFFBB<7BFFFFFFFFFF/FF<FBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB
HISEQ:145:C81UYANXX:5:1101:6933:1991 1:N:0:ACTGAT	4	*	0	0	*	*	0	0	ATGTCGACCTTACTTCCGTCTTATGTGAACGGATGACTGTGTATTACACTTTTTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC	BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FF/FFFFFFFFFFB7<BBFFFBF<FFBBFF/B/BF#####################################################

Sequence "HISEQ:145:C81UYANXX:5:1101:6933:1991" should have the following sequences reported:

zcat AP_ACTGAT_L005_R1_001.fastq.gz | grep -A 3 HISEQ:145:C81UYANXX:5:1101:6933:1991

@HISEQ:145:C81UYANXX:5:1101:6933:1991 1:N:0:ACTGAT
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAGTGTAATACACAGTCATCCGTTCACATAAGACGGAAGTAAGGTCGACAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FF/FFFFFFFFFFB7<BBFFFBF<FFBBFF/B/BF#####################################################

zcat AP_ACTGAT_L005_R2_001.fastq.gz | grep -A 3 HISEQ:145:C81UYANXX:5:1101:6933:1991

@HISEQ:145:C81UYANXX:5:1101:6933:1991 2:N:0:ACTGAT
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAACACACTAAGGCGAACCTCACACCAGTCTTGATGCTCACTACTCAGCACGTCTGTGTGT
+
BBBBBFFFFFFFFFFFBFBFFFF<FFFFFFFFBFFFFFBFFFFFF<B7FF/FF<FFB/<FFFBFFFFF##########################################################

You'll note that the second sequence is completely missing in the BAM output (I searched) and is instead replaced with the reverse compliment of R1. There is no reason for this activity...

If I align instead with bowtie2...

bowtie2 --threads 15 -x /media/evo/genomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome -1 sequences/AP_ACTGAT_L005_R1_001.fastq.gz -2 sequences/AP_ACTGAT_L005_R2_001.fastq.gz -S AP-hg19-bowtie2.sam

...I get the following SAM flag output:

(note: SAM headers were skipped with the !~ condition below)

cat AP-hg19-bowtie2.sam | awk '$1 !~ /^@/ {a[$2]++} END {for (b in a) {print b "\t" a[b]}}'

65 64153
69 122761
73 108047
77 905228
81 1193470
83 7405195
89 109696
97 1192505
99 7400675
113 63176
129 64153
133 217743
137 60889
141 905228
145 1192505
147 7400675
153 61872
161 1193470
163 7405195
177 63176

There's still some "weirdness" (flags 73 and 89 both mate with 133) but everything is tagged by bowtie2 in a way that I can easily grab pairs where both are unmapped (77 / 141), or where one end maps while the other doesn't (73|89 / 133).

Most importantly, bowtie2 doesn't substitute fake the 2nd read for failed mappings.

Remember "HISEQ:145:C81UYANXX:5:1101:6933:1991"? Both R1 and R2 are properly represented in the bowtie2 output.

cat AP-hg19-bowtie2.sam | grep HISEQ:145:C81UYANXX:5:1101:6933:1991

HISEQ:145:C81UYANXX:5:1101:6933:1991 77 * 0 0 * * 0 0 GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAGTGTAATACACAGTCATCCGTTCACATAAGACGGAAGTAAGGTCGACAT BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FF/FFFFFFFFFFB7<BBFFFBF<FFBBFF/B/BF##################################################### YT:Z:UP
HISEQ:145:C81UYANXX:5:1101:6933:1991 141 * 0 0 * * 0 0 GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAACACACTAAGGCGAACCTCACACCAGTCTTGATGCTCACTACTCAGCACGTCTGTGTGT BBBBBFFFFFFFFFFFBFBFFFF<FFFFFFFFBFFFFFBFFFFFF<B7FF/FF<FFB/<FFFBFFFFF########################################################## YT:Z:UP

nvBowtie is great for getting the stuff that *does* align and the output is nearly identical to bowtie2 at 3x the speed. This sloppy handling of unmapped reads, however, really puts a damper on trusting any output that isn't flagged 83 / 163 | 99 / 147.

Let me know what I can contribute to help fix this problem.

Nuno Subtil

unread,

Dec 15, 2015, 1:17:04 AM12/15/15

to nvbio-users

Hi Kyle,

Looking at the output code, I think I can see where these problems might be coming from. Most of these issues should be relatively simple to fix.

Would you be able to share your data set so that I can debug this?

Thanks,

Nuno

Reply all

Reply to author

Forward

Message has been deleted

0 new messages