Hi,
I ran tophat2(v2.0.9) and bowtie2(v1.1.0) with default settings using a GTF for UCSC known genes downloaded from the UCSC genome browser. I noticed a fairly large discrepency between the align_summary.txt and samtools flagstat (see below). I have bolded the discrepancies between the alignment summaries.
> cat align_summary.txt
Mapped: 2988680 (74.7% of input)
of these: 218271 ( 7.3%) have multiple alignments (292 have >20)
Mapped: 2916894 (72.9% of input)
of these: 209473 ( 7.2%) have multiple alignments (12 have >20)
73.8% overall read alignment rate.
of these: 134377 ( 4.8%) have multiple alignments
and: 89469 ( 3.2%) are discordant alignments
67.9% concordant pair alignment rate.
> samtools view -u -F 0x100 accepted_hits.sorted.bam | samtools flagstat -
6114505 + 0 in total (QC-passed reads + QC-failed reads)
6114505 + 0 mapped (100.00%:nan%)
6114505 + 0 paired in sequencing
405380 + 0 properly paired (6.63%:nan%)
5610950 + 0 with itself and mate mapped
503555 + 0 singletons (8.24%:nan%)
135560 + 0 with mate mapped to a different chr
45998 + 0 with mate mapped to a different chr (mapQ>=5)
To investigate what was going on, I examined a single read with a high mapping quality in the accepted_hits.bam file. It showed that this read mapped uniquely to the genome, but the forward read was mapping to chr1 while the reverse read was mapping to chr6
FORWARD_READ 129 chr1 98853 50 100M chr6 170993888 0 GTGGGGAGGGAAGAAAAAAGGAGGGGGAGAGAAAGAGAAATAAGAACCAAGTTTATTATACTGTATTCAGGGGGAAAAAATTTTCCCAAGGTCCTAACAG B@@FDFDDHHFFFHIIJGIIIBGHIJI/BAAEEHF>@CEEEEEEDDDDDDB>ADDCDDEEDED>CDDDDDDD;BBBBADDBCCDDA@>?AB:@@>:C@C> AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
REVERSE_READ 65 chr6 170993888 50 100M chr1 98853 0 ACACCTGTAATCCCAGCTACTTGGGCGGCTGAGGCAGGAGAATCACTTGAACCTGGGAGGCAGAAGTTGCAGTGAGTCAAGATTGCGCCAATGCACTCCA @C@FFEFFHFHHHJJJJIIJIIJJIIGGIIIJJJIIIGDGGIGHFHHHCEFFFCDEC?BDDDBBDC:AACCD3>>::@@@C@CDDC><B@BCCCCCACDA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
GTGGGGAGGGAAGAAAAAAGGAGGGGGAGAGAAAGAGAAATAAGAACCAAGTTTATTATACTGTATTCAGGGGGAAAAAATTTTCCCAAGGTCCTAACAG
ACACCTGTAATCCCAGCTACTTGGGCGGCTGAGGCAGGAGAATCACTTGAACCTGGGAGGCAGAAGTTGCAGTGAGTCAAGATTGCGCCAATGCACTCCA
ACACCTGTAATCCCAGCTACTTGGGCGGCTGAGGCAGGAGAATCACTTGAACCTGGGAGGCAGAAGTTGCAGTGAGTCAAGATTGCGCCAATGCACTCCACTGTTAGGACCTTGGGAAAATTTTTTCCCCCTGAATACAGTATAATAAACTTGGTTCTTATTTCTCTTTCTCTCCCCCTCCTTTTTTCTTCCCTCCCCAC
The results of the blat showed that the best hit to FORWARD_READ was chr1 and the best hit to REVERSE_READ was in chr6. However, these paired reads mapped concordantly to both chr1 and chr6. Tophat had reported the best alignment for a single read rather than the read pair! I suspect tophat was counting this pair as "concordantly mapped", yet reported a discordant mapping to the output file.
Is this working as intended?
Best Regards,
Hyunsung John Kim
PhD candidate
UC Santa Cruz