Dear STAR community,
during a recent analysis of paired-end rna-seq data where the
mapping was done with STAR, we found out (in the midst of the analysis)
from the sequencing centre that the dataset was highly contaminated with
ribosomal RNA reads. We verified their claim with bowtie2, finding out that rRNA
indeed comprised up to 60 % of all reads in separate samples.
However, when mapped with STAR, these reads were either unmapped, or uniquely mapped to non-rRNA. We did several
runs changing some of the STAR parameters, and it turns out that the
parameters --outFilterScoreMinOverLread and
--outFilterMatchNminOverLread played a big role in determining the mapping of these reads.
When using the default value of 0.66,
only around 50 % of reads were uniquely mapped, 6 % multi-mapped, and
around 40 % were unmapped (too short). The --outFilterMultimapNmax was
set to 20. Prior to knowing about the contamination, we reduced the
values of the two *OverLread parameters to 0.25, which resulted in 76 % of reads being
uniquely mapped, 14 % being multi-mapped, and only 4 % of reads
remaining unmapped due to them being too short.
Most of the
original reads which were unmapped in the first run due to them being too short were now
mapping uniquely, which we checked by lowering the parameter
--outFilterMultimapNmax to 2. With this, the number of multi-mapped
reads lowered to 10 %, and the remaining 4 % in this category were now
mapped to too many loci.
After realizing about the contamination,
we went to check what is the amount of reads mapping to rRNA in the
STAR output. The count was surprisingly low, and a more detailed look
revealed that the rRNA reads were mapped to other regions, and not to
rRNA.
Thus, we came to the conclusion that a) with
--outFilterScoreMinOverLread and --outFilterMatchNminOverLread set to
default value of 0.66, the contaminant rRNA reads were not being mapped
due to being too short (and not because of multi-mapping!), and b) with
lowering the value of --outFilterScoreMinOverLread and
--outFilterMatchNminOverLread to 0.25, those reads were now being mostly
uniquely mapped, and furthermore to regions other than rRNA.
Other STAR parameters set in these runs were
--readFilesCommand
zcat --outSAMstrandField intronMotif --outReadsUnmapped Fastx
--outSAMattributes All --outFilterMultimapNmax 20 --outFilterScoreMin 1
--outFilterMatchNmin 1 --chimSegmentMin 15 --chimScoreMin 15
--chimScoreSeparation 10 --chimJunctionOverhangMin 15
--alignSJoverhangMin 8 --outFilterMismatchNmax 999
--outFilterMismatchNoverLmax 0.04
We also checked some of these for the effect they had on the mapping, which turned out to be negligible for our problem at hand.
We would very much appreciate any help, comments, ...
Kind regards,
anze