Questions About STAR and TopHat2 different outputs

Daniele Capocefalo

unread,

Feb 9, 2016, 11:39:42 AM2/9/16

to rna-star

Hello everyone,

I'm Daniele and I'm pretty new to STAR and RNA-SEQ in general. I'm using both TopHat2 and STAR to map reads coming from a paired-end-first-stranded experiment, in order to asses alignment performances and determine which is the most suitable tools for my analysis.

Before going further with my questions, I'll explain in brief what I did for this analysis.

- I've aligned my reads using default parameters in both programs, specify strandness (in Tophah2) accordingly;
- I've filtered out all BAM files, discarding multiple alignments, discordant reads and reads with mate not mapped (I've kept 'not properly paired' reads since I was uncertain regarding the insert size);
- I've gathered a few statistics with PICARD and bamtools stats (i'll report the latter sinbce it's more straightforward)

After these steps, looking at the alignments statistics, a couple of incongruencies rose up. Here's the report for STAR reads (for a random sample):

Total reads:       45626974
Mapped reads:      45626974    (100%)
Forward strand:    22813487    (50%)
Reverse strand:    22813487    (50%)
Failed QC:         0    (0%)
Duplicates:        0    (0%)
Paired-end reads: 45626974    (100%)
'Proper-pairs':    45626974    (100%)
Both pairs mapped: 45626974    (100%)
Read 1:            22813487
Read 2:            22813487
Singletons:        0    (0%)
Average insert size (absolute value): 1176.68
Median insert size (absolute value): 173

and here's TopHat 2 statistics post filtering for the same sample:

Total reads:       40776248
Mapped reads:      40776248    (100%)
Forward strand:    20388934    (50.002%)
Reverse strand:    20387314    (49.998%)
Failed QC:         0    (0%)
Duplicates:        0    (0%)
Paired-end reads: 40776248    (100%)
'Proper-pairs':    38154534    (93.5705%)
Both pairs mapped: 40776248    (100%)
Read 1:            20388124
Read 2:            20388124
Singletons:        0    (0%)
Average insert size (absolute value): 16365.8
Median insert size (absolute value): 175

First of all, my first doubt concerns the percentage of reads that mapped on the forward and the reverse strand. While I expect the number to be AROUND 50% for both aligners, it turns out that star ALWAYS has a neat 50:50 ratio between reads mapped on the forward and reverse strand, and I find this curious (these results are reflected on all my samples, no exception). Is this normal? What is STAR strategy for aligning paired end reads on this behalf?

Moreover, another questions raises when looking at the "proper pairs". Is this normal that ALL the paired end alignments are properly paired are there flagged properly regardless of the insert size?

Finally, another thing I noticed is that the number of uniquely mapped reads for STAR BEFORE filtering is the same as the number of uniquely-mapped-properly-paired-concordant reads post filtering:
Star alignment summary (raw reads, before filtering, same sample as before):

Uniquely mapped reads number | 22813487

Does the uniquely mapped number of reads in the STAR summary means "all the uniquely mapped-concordant-properly paired" reads?

I'm sorry if I didn't explained myself properly and for the length of the subject, it's still not easy for me to handle these subjects :P

Thanks a lot for your time!

Daniele

Alexander Dobin

unread,

Feb 10, 2016, 6:50:12 PM2/10/16

to rna-star

Hi Daniele,

>>> First of all, my first doubt concerns the percentage of reads that mapped on the forward and the reverse strand. While I expect the number to be AROUND 50% for both aligners, it turns out that star ALWAYS has a neat 50:50 ratio between reads mapped on the forward and reverse strand, and I find this curious (these results are reflected on all my samples, no exception). Is this normal? What is STAR strategy for aligning paired end reads on this behalf?

Moreover, another questions raises when looking at the "proper pairs". Is this normal that ALL the paired end alignments are properly paired are there flagged properly regardless of the insert size?

By default STAR outputs only correctly paired alignments, which requires that the mates are on opposite strands - hence 22813487 reads on each strand.

Incorrectly paired alignments can be output as "chimeric" in a separate file.

>>> Finally, another thing I noticed is that the number of uniquely mapped reads for STAR BEFORE filtering is the same as the number of uniquely-mapped-properly-paired-concordant reads post filtering:
Star alignment summary (raw reads, before filtering, same sample as before):

Uniquely mapped reads number | 22813487

Does the uniquely mapped number of reads in the STAR summary means "all the uniquely mapped-concordant-properly paired" reads?

Yes, that's correct - in the Log.final.out STAR considers the paired-end read as one read. The terminology is a bit murky here - the ends of the PE read can be called mates, or read1/read2.

Cheers

Alex

Daniele Capocefalo

unread,

Feb 16, 2016, 4:12:09 AM2/16/16

to rna-star

Dear Dr. Dobin,

Thanks a lot for your quick reply and your clarifications, they were very explicative indeed. Another quick questions if I may ask: how does STAR Deals with insert size computations? I have found no mention of that in the manual. Is it possible that many of the concordantly paired read may exceed the expected insert size? (according to the experimental protocol that has been used to produce reads). Also, is there a way to limit this apart from a post-alignment-filtering by size-selecting insert size? (which in my ipinion is also incorrect because I'm possibly filtering out good alignment out of the expected insert size for several biological reasons, I know).

Again, thanks a lot for your help!

Cheers,

Daniele

Alexander Dobin

unread,

Feb 16, 2016, 12:38:28 PM2/16/16

to rna-star

Hi Daniele,

>>>how does STAR Deals with insert size computations?

please check this post https://groups.google.com/d/msg/rna-star/5i6mkGcUrB4/yy0qQRRlGgAJ

>>>Is it possible that many of the concordantly paired read may exceed the expected insert size?

Yes, the "genomic" insert size (i.e. genomic distance between two ends) will exceed the nominal insert size for all spliced reads.

>>>Also, is there a way to limit this apart from a post-alignment-filtering by size-selecting insert size?

Even if you subtract the splice junctions gaps in each mate, you cannot know whether there is a junction (annotated or novel) in the unsequenced portion of the insert between two reads.

In principle, you can transform the genomice coordinates into transcriptomic (with --quantMode TranscriptomeSAM), and check the insert size for the alignment to the transcript(s).

This "transcriptomic" insert size should follow the distribution of cDNA fragments in your library. However, this will only work if you have a *full* collection of transcript, including all the novel ones.