Would STAR filter redundant reads for RNA-seq mapping

758 views
Skip to first unread message

chensu...@gmail.com

unread,
Jan 9, 2017, 9:24:39 PM1/9/17
to rna-star
Hi Alex, 

Is there any kind of filter for redundancy in STAR mapping, like keeping up to a max INT number of reads for duplicated reads?

Best
Sujun

Alexander Dobin

unread,
Jan 10, 2017, 4:49:11 PM1/10/17
to rna-star
Hi Sujun,

not sure what you mean by "duplicated" reads - are these multimappers, i.e. reads that map to multiple loci in the genome?
STAR has options to limit those.

Or are you talking about "PCR" duplicates, i.e. different reads that have exactly the same sequences?

Cheers
Alex

chensu...@gmail.com

unread,
Jan 16, 2017, 9:57:25 AM1/16/17
to rna-star
Hi Alex, 

Thanks for getting back to me. Seems you didn't receive my reply. 
I mean reads with exactly the same sequences. By the way, do you think PCR is the only source for such reads -- ouldn't it be real biological meaningful reads considering the fact that there are multiple templates for abundantly expressed genes? 
I was using circexplore.py in circular RNA detection and tophat/star can lead to quite different results. For the overlapping circular transcripts, star usually gives higher number for transcripts with lower junction reads but lower  for those with higher junction reads compared to tophat. I was trying to figure out why and wondering whether it is star having some kind of filter on duplicated reads which results in lower junction reads for more abundant circular transcripts. If you have any thoughts on what might raise such differences, please let me know. 

Best

Alexander Dobin

unread,
Jan 18, 2017, 3:59:15 PM1/18/17
to rna-star
Hi Sujun,

I mean reads with exactly the same sequences. By the way, do you think PCR is the only source for such reads -- ouldn't it be real biological meaningful reads considering the fact that there are multiple templates for abundantly expressed genes? 

You are right, these could be true duplicate RNA fragments, especially for single-end reads. For this reason, removing them may not be a good idea for expression analysis.
  
I was using circexplore.py in circular RNA detection and tophat/star can lead to quite different results. For the overlapping circular transcripts, star usually gives higher number for transcripts with lower junction reads but lower  for those with higher junction reads compared to tophat. I was trying to figure out why and wondering whether it is star having some kind of filter on duplicated reads which results in lower junction reads for more abundant circular transcripts. If you have any thoughts on what might raise such differences, please let me know. 

I am not sure how this software works - does it use STAR's "chimeric" detection, which is required to detect circular junctions?

Cheers
Alex

chensu...@gmail.com

unread,
Jan 18, 2017, 10:29:35 PM1/18/17
to rna-star
Thanks Alex. 
Yes CIRCexplorer takes results from chimeric detection of STAR. It quantifies circular RNA abundance using junction reads given by STAR. That is to say, for certain "chimeric" alignment, STAR gives less supporting junction reads compared to tophat-fusion (usually for chimeric junctions with higher abundance). 

Best
Sujun

Alexander Dobin

unread,
Jan 19, 2017, 5:05:15 PM1/19/17
to rna-star
Hi Sujun,

STAR uses very conservative filters for chimeric (including circular) alignments.
For instance, it will only output uniquely mapping chimeras, and for PE reads, only paired aligments (no single-end chimeras).
I am not familiar with TopHat-Fusion and do not know what kind of filters it uses. 

There are many parameters that will increase the number of detected chimeras, as listed below.
Of course, any increase in sensitivity will be traded off an increase in the false positive rate.

If you can make an example with a few reads with circular junction detected by TopHat-Fusion, but not STAR, I can look into it more.

Cheers
Alex


chimSegmentMin              0
    int>=0: minimum length of chimeric segment length, if ==0, no chimeric output

chimScoreMin                0
    int>=0: minimum total (summed) score of the chimeric segments

chimScoreDropMax            20
    int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segements) from the read length

chimScoreSeparation         10
    int>=0: minimum difference (separation) between the best chimeric score and the next one

chimScoreJunctionNonGTAG    -1
    int: penalty for a non-GT/AG chimeric junction

chimJunctionOverhangMin     20
    int>=0: minimum overhang for a chimeric junction

chimSegmentReadGapMax       0
    int>=0: maximum gap in the read sequence between chimeric segments
Reply all
Reply to author
Forward
0 new messages