Would STAR filter redundant reads for RNA-seq mapping

chensu...@gmail.com

unread,

Jan 9, 2017, 9:24:39 PM1/9/17

to rna-star

Hi Alex,

Is there any kind of filter for redundancy in STAR mapping, like keeping up to a max INT number of reads for duplicated reads?

Best

Sujun

Alexander Dobin

unread,

Jan 10, 2017, 4:49:11 PM1/10/17

to rna-star

Hi Sujun,

not sure what you mean by "duplicated" reads - are these multimappers, i.e. reads that map to multiple loci in the genome?

STAR has options to limit those.

Or are you talking about "PCR" duplicates, i.e. different reads that have exactly the same sequences?

Cheers

Alex

chensu...@gmail.com

unread,

Jan 16, 2017, 9:57:25 AM1/16/17

to rna-star

Hi Alex,

Thanks for getting back to me. Seems you didn't receive my reply.

I mean reads with exactly the same sequences. By the way, do you think PCR is the only source for such reads -- ouldn't it be real biological meaningful reads considering the fact that there are multiple templates for abundantly expressed genes?

I was using circexplore.py in circular RNA detection and tophat/star can lead to quite different results. For the overlapping circular transcripts, star usually gives higher number for transcripts with lower junction reads but lower for those with higher junction reads compared to tophat. I was trying to figure out why and wondering whether it is star having some kind of filter on duplicated reads which results in lower junction reads for more abundant circular transcripts. If you have any thoughts on what might raise such differences, please let me know.

Best

Alexander Dobin

unread,

Jan 18, 2017, 3:59:15 PM1/18/17

to rna-star

Hi Sujun,

I mean reads with exactly the same sequences. By the way, do you think PCR is the only source for such reads -- ouldn't it be real biological meaningful reads considering the fact that there are multiple templates for abundantly expressed genes?

You are right, these could be true duplicate RNA fragments, especially for single-end reads. For this reason, removing them may not be a good idea for expression analysis.

I was using circexplore.py in circular RNA detection and tophat/star can lead to quite different results. For the overlapping circular transcripts, star usually gives higher number for transcripts with lower junction reads but lower for those with higher junction reads compared to tophat. I was trying to figure out why and wondering whether it is star having some kind of filter on duplicated reads which results in lower junction reads for more abundant circular transcripts. If you have any thoughts on what might raise such differences, please let me know.

I am not sure how this software works - does it use STAR's "chimeric" detection, which is required to detect circular junctions?

Cheers

Alex

chensu...@gmail.com

unread,

Jan 18, 2017, 10:29:35 PM1/18/17

to rna-star

Thanks Alex.

Yes CIRCexplorer takes results from chimeric detection of STAR. It quantifies circular RNA abundance using junction reads given by STAR. That is to say, for certain "chimeric" alignment, STAR gives less supporting junction reads compared to tophat-fusion (usually for chimeric junctions with higher abundance).

Best

Sujun

Alexander Dobin

unread,

Jan 19, 2017, 5:05:15 PM1/19/17

to rna-star

Hi Sujun,

STAR uses very conservative filters for chimeric (including circular) alignments.

For instance, it will only output uniquely mapping chimeras, and for PE reads, only paired aligments (no single-end chimeras).

I am not familiar with TopHat-Fusion and do not know what kind of filters it uses.

There are many parameters that will increase the number of detected chimeras, as listed below.

Of course, any increase in sensitivity will be traded off an increase in the false positive rate.

If you can make an example with a few reads with circular junction detected by TopHat-Fusion, but not STAR, I can look into it more.

Cheers

Alex

chimSegmentMin 0

int>=0: minimum length of chimeric segment length, if ==0, no chimeric output

chimScoreMin 0

int>=0: minimum total (summed) score of the chimeric segments

chimScoreDropMax 20

int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segements) from the read length

chimScoreSeparation 10

int>=0: minimum difference (separation) between the best chimeric score and the next one

chimScoreJunctionNonGTAG -1

int: penalty for a non-GT/AG chimeric junction

chimJunctionOverhangMin 20

int>=0: minimum overhang for a chimeric junction

chimSegmentReadGapMax 0

int>=0: maximum gap in the read sequence between chimeric segments

Reply all

Reply to author

Forward