problem with ribosomal RNA mapping

Anze Bozic

unread,

Aug 21, 2014, 10:54:57 AM8/21/14

to rna-...@googlegroups.com

Dear STAR community,

during a recent analysis of paired-end rna-seq data where the mapping was done with STAR, we found out (in the midst of the analysis) from the sequencing centre that the dataset was highly contaminated with ribosomal RNA reads. We verified their claim with bowtie2, finding out that rRNA indeed comprised up to 60 % of all reads in separate samples.

However, when mapped with STAR, these reads were either unmapped, or uniquely mapped to non-rRNA. We did several runs changing some of the STAR parameters, and it turns out that the parameters --outFilterScoreMinOverLread and --outFilterMatchNminOverLread played a big role in determining the mapping of these reads.

When using the default value of 0.66, only around 50 % of reads were uniquely mapped, 6 % multi-mapped, and around 40 % were unmapped (too short). The --outFilterMultimapNmax was set to 20. Prior to knowing about the contamination, we reduced the values of the two *OverLread parameters to 0.25, which resulted in 76 % of reads being uniquely mapped, 14 % being multi-mapped, and only 4 % of reads remaining unmapped due to them being too short.

Most of the original reads which were unmapped in the first run due to them being too short were now mapping uniquely, which we checked by lowering the parameter --outFilterMultimapNmax to 2. With this, the number of multi-mapped reads lowered to 10 %, and the remaining 4 % in this category were now mapped to too many loci.

After realizing about the contamination, we went to check what is the amount of reads mapping to rRNA in the STAR output. The count was surprisingly low, and a more detailed look revealed that the rRNA reads were mapped to other regions, and not to rRNA.

Thus, we came to the conclusion that a) with --outFilterScoreMinOverLread and --outFilterMatchNminOverLread set to default value of 0.66, the contaminant rRNA reads were not being mapped due to being too short (and not because of multi-mapping!), and b) with lowering the value of --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to 0.25, those reads were now being mostly uniquely mapped, and furthermore to regions other than rRNA.

Other STAR parameters set in these runs were

--readFilesCommand zcat --outSAMstrandField intronMotif --outReadsUnmapped Fastx --outSAMattributes All --outFilterMultimapNmax 20 --outFilterScoreMin 1 --outFilterMatchNmin 1 --chimSegmentMin 15 --chimScoreMin 15 --chimScoreSeparation 10 --chimJunctionOverhangMin 15 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04

We also checked some of these for the effect they had on the mapping, which turned out to be negligible for our problem at hand.

We would very much appreciate any help, comments, ...

Kind regards,

anze

Alexander Dobin

unread,

Aug 22, 2014, 6:54:31 PM8/22/14

to rna-...@googlegroups.com

Hi Anze,

when you generated the genome, did you use all the scaffolds, or just the main chromosomes?

If the latter is true, the symptoms you are describing can be explained as follows.

Some highly expressed rRNA loci are included in the scaffolds, and not in the chromosomes (the notorious one is chrUn_gl000220 in hg19 notation). If these scaffolds are not included in the genome generation step, STAR will try to map them elsewhere, likely with multiple mismatches and severely trimmed, and they may become unique mappers. If you reduce -outFilterScoreMinOverLread and --outFilterMatchNminOverLread you will allow even more of this false alignments. If all the scaffolds are included, most of these false alignments will map to their proper places as multi-mappers, and you will also see a significant reduction of the unmapped reads

which will also become multi-mappers.

We have observed this effect in many samples in ENCODE RNA-seq, it's especially severe for the total RNA-seq data.

Please let me know if this suggestion solves your problem.

Cheers

Alex

Anze Bozic

unread,

Aug 25, 2014, 6:43:29 AM8/25/14

to rna-...@googlegroups.com

Hi Alex,

thanks for the quick answer! We are using the indices provided by you (ENSEMBL -- mmu), which do contain the scaffolds, no? So it seems that this is unfortunately not the issue.. We are still puzzled about the origin of it.

Cheers,

anze

Alexander Dobin

unread,

Aug 26, 2014, 12:18:12 PM8/26/14

to rna-...@googlegroups.com

Hi Anze,

did you download the genome from here:

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ENSEMBL/mus_musculus/ENSEMBL.mus_musculus.release-75/
Then it contains the scaffolds I was talking about. The situation I described above was for the human genome, I was thinking it might be true for mouse as well, but it does not work the same way apparently.

It's possible that mouse scaffolds do not contain all the necessary rRNA sequences.

Which sequences did you use for bowtie mapping? Could you send me a few thousands lines of the bowtie alignments - I will try to figure out why they do not map to the genome.

Cheers

Alex

Alexander Predeus

unread,

Aug 27, 2014, 7:08:31 PM8/27/14

to rna-...@googlegroups.com

Actually I've looked into this problem in detail and if you use Gencode vM2/v19 and appropriate genome fasta (with all extra scaffolds, downloaded from Gencode website), STAR definitely maps rRNA

furthermore, estimated percentages are very close to what I was getting with Tophat

I use Picard's CollectRnaSeqMetrics on every experiment I process and check rRNA percentage as a part of routine quality control.

Anze Bozic

unread,

Aug 28, 2014, 9:17:45 AM8/28/14

to rna-...@googlegroups.com

Hi Alex,

yes, that's exactly where it was downloaded from. I'm attaching the first 10000 lines of a sample bowtie alignment to help you make sense of the problem; please let me know if you'll need anything else. I'll appreciate any help!

Thanks,

anze

bowtie_sample.sam

Anze Bozic

unread,

Aug 28, 2014, 9:49:29 AM8/28/14

to rna-...@googlegroups.com

Sorry, forgot to mention -- the sequences used for the mapping were

>gi|511668571|tpg|BK000964.3| TPA_exp: Mus musculus ribosomal DNA, complete repeating unit

Best,

anze

--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/luYiREFD9f8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

sisterdot

unread,

Sep 29, 2015, 10:29:12 AM9/29/15

to rna-star

hey Anze, hey all,

got a similar observation- bowtie mapping to BK000964 identifies higher rRNA contamination than using STAR versus GRCm38.p4.genome.fa (from gencode, including scaffolds).

but its likely not because of rRNA contained in the scaffolds- or at least
gencode.vM6.chr_patch_hapl_scaff.annotation.gtf
indicates that rRNA is only found on chromosomes and not scaffolds.

the GRCm38/mm10 genome does not include BK000964 based on BLAT.
without testing- i am assuming the mm10 genome sequence is lacking part of the rRNA information.

greets

Reply all

Reply to author

Forward