Hello everyone,
I'm Daniele and I'm pretty new to STAR and RNA-SEQ in general. I'm using both TopHat2 and STAR to map reads coming from a paired-end-first-stranded experiment, in order to asses alignment performances and determine which is the most suitable tools for my analysis.
Before going further with my questions, I'll explain in brief what I did for this analysis.
- I've aligned my reads using default parameters in both programs, specify strandness (in Tophah2) accordingly;
- I've filtered out all BAM files, discarding multiple alignments, discordant reads and reads with mate not mapped (I've kept 'not properly paired' reads since I was uncertain regarding the insert size);
- I've gathered a few statistics with PICARD and bamtools stats (i'll report the latter sinbce it's more straightforward)
After these steps, looking at the alignments statistics, a couple of incongruencies rose up. Here's the report for STAR reads (for a random sample):
Total reads: 45626974
Mapped reads: 45626974 (100%)
Forward strand: 22813487 (50%)
Reverse strand: 22813487 (50%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 45626974 (100%)
'Proper-pairs': 45626974 (100%)
Both pairs mapped: 45626974 (100%)
Read 1: 22813487
Read 2: 22813487
Singletons: 0 (0%)
Average insert size (absolute value): 1176.68
Median insert size (absolute value): 173
and here's TopHat 2 statistics post filtering for the same sample:
Total reads: 40776248
Mapped reads: 40776248 (100%)
Forward strand: 20388934 (50.002%)
Reverse strand: 20387314 (49.998%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 40776248 (100%)
'Proper-pairs': 38154534 (93.5705%)
Both pairs mapped: 40776248 (100%)
Read 1: 20388124
Read 2: 20388124
Singletons: 0 (0%)
Average insert size (absolute value): 16365.8
Median insert size (absolute value): 175
First of all, my first doubt concerns the percentage of reads that mapped on the forward and the reverse strand. While I expect the number to be AROUND 50% for both aligners, it turns out that star ALWAYS has a neat 50:50 ratio between reads mapped on the forward and reverse strand, and I find this curious (these results are reflected on all my samples, no exception). Is this normal? What is STAR strategy for aligning paired end reads on this behalf?
Moreover, another questions raises when looking at the "proper pairs". Is this normal that ALL the paired end alignments are properly paired are there flagged properly regardless of the insert size?
Finally, another thing I noticed is that the number of uniquely mapped reads for STAR BEFORE filtering is the same as the number of uniquely-mapped-properly-paired-concordant reads post filtering:
Star alignment summary (raw reads, before filtering, same sample as before):
Uniquely mapped reads number | 22813487
Does the uniquely mapped number of reads in the STAR summary means "all the uniquely mapped-concordant-properly paired" reads?
I'm sorry if I didn't explained myself properly and for the length of the subject, it's still not easy for me to handle these subjects :P
Thanks a lot for your time!
Daniele