Alignment stats (first STAR run)

Emily Blake

unread,

Oct 24, 2016, 12:21:30 PM10/24/16

to rna-star

This is my first STAR (version 2.5.2b) run, and I have a few questions about the statistics in the Log.final.out file. I'm using HOMER's map-star.pl script to map 100bp PE reads.

$ STAR --genomeLoad LoadAndKeep --outReadsUnmapped Fastx --genomeDir /path/to/genome/star-indexes/hg38-starIndex --runThreadN 72 --readFilesIn SM01_R1_merged_val_1.fq SM01_R2_merged_val_2.fq --outFileNamePrefix SM01_R1_merged_val_1.fq.hg38-starIndex.

Log.final.out output:

Started job on | Oct 19 15:13:30
Started mapping on | Oct 19 15:15:14
Finished on | Oct 19 15:20:37
Mapping speed, Million of reads per hour | 309.93

Number of input reads | 27807734
Average input read length | 194
UNIQUE READS:
Uniquely mapped reads number | 20450138
Uniquely mapped reads % | 73.54%
Average mapped length | 192.66
Number of splices: Total | 9387626
Number of splices: Annotated (sjdb) | 9106647
Number of splices: GT/AG | 9247743
Number of splices: GC/AG | 76354
Number of splices: AT/AC | 7323
Number of splices: Non-canonical | 56206
Mismatch rate per base, % | 0.18%
Deletion rate per base | 0.01%
Deletion average length | 1.47
Insertion rate per base | 0.01%
Insertion average length | 1.36
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 5996936
% of reads mapped to multiple loci | 21.57%
Number of reads mapped to too many loci | 61402
% of reads mapped to too many loci | 0.22%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 4.63%
% of reads unmapped: other | 0.04%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

Based on the STAR documentation guide found here: BioCloud RNA-Seq (STAR) Result Documentation, I have the following questions:

Is 73.54% uniquely mapped reads OK?
The mismatch rate per base is 0.18%. If a good library is anywhere between 0.5%-0.8%, is the library very high quality?
The % of reads mapped to multiple loci is 21.57%. As I understand it, this number is very high. What could be the issue here?
The % of reads unmapped due to too short is 4.63%; is this value decent? What constitutes a poor or good % related to sequencing quality?

Any other advice or critique of output is welcome and appreciated!

Thanks!

Alexander Dobin

unread,

Oct 25, 2016, 6:39:00 PM10/25/16

to rna-star

Hi Emily,

from my experience with ENCODE data

~5% of unmapped reads is very good - indicates high quality of both library prep and sequencing.

0.18% mismatch rate, and mapped length very close to read length also indicate good sequencing quality.

~22% of multimappers is on the high side, but not catastrophic. If your libraries are total RNA-seq (not A+), this could be caused by not complete rRNA depletion.

I would make a wiggle track of multi-mappers only and put it on the browser to see which regions of the genome it covers.

The remaining ~73% of uniquely mapped reads is OK in the senses that you did not loose too much to the unmapped reads and multimappers.

Cheers

Alex

Emily Blake

unread,

Oct 26, 2016, 11:24:39 AM10/26/16

to rna-star

Thanks, Alex! I will look into the multi mapped reads. Do you think using FastQ Screen to check rRNA contamination may be able to provide insight? I'm thinking I should include this tool check in my QC pipeline prior to mapping.

Best,
Emily

Alexander Dobin

unread,

Oct 28, 2016, 11:58:43 AM10/28/16

to rna-star

Hi Emily,

I have not used this program, at the first sight it looks good.

If you are including (strongly recommended) the scaffolds when generating the genome index, you should be able to detect most rRNA reads, as the highest expressed rRNA locus resides on one of the scaffolds. You can also use bedTools to calculate coverage of the rRNA loci from the BAM files.