Hi All,
I am new in RNAseq data analysis. I have aligned my RNAseq data using STAR-RSEM pipeline created by others. The overall mapping rate is pretty good. However, the percentage of uniquely mapped reads is 66.2%, which is lower than expected. (I searched online, it says around 80% is good for uniquely mapping.) I also tried Tophat it seems a little bit better than STAR for the uniquely mapped reads but the overall mapping rate is lower.
Is there anyway to improve the alignment or find out the reasons for multiple alignment? Thanks in advance for your suggestions and commons.
the alignment summaries show as follows:
Number of input reads | 93020561
Average input read length | 202
UNIQUE READS:
Uniquely mapped reads number | 61590714
Uniquely mapped reads % | 66.21%
Average mapped length | 199.28
Number of splices: Total | 19079205
Number of splices: Annotated (sjdb) | 19076527
Number of splices: GT/AG | 18851745
Number of splices: GC/AG | 204647
Number of splices: AT/AC | 13788
Number of splices: Non-canonical | 9025
Mismatch rate per base, % | 0.26%
Deletion rate per base | 0.01%
Deletion average length | 1.80
Insertion rate per base | 0.01%
Insertion average length | 1.54
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 29905331
% of reads mapped to multiple loci | 32.15%
Number of reads mapped to too many loci | 576652
% of reads mapped to too many loci | 0.62%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.05%
% of reads unmapped: too short | 0.81%
% of reads unmapped: other | 0.16%
tophat2 summary:
Left reads:
Input : 93020561
Mapped : 87667354 (94.2% of input)
of these: 15415147 (17.6%) have multiple alignments (96621 have >20)
Right reads:
Input : 93020561
Mapped : 83893284 (90.2% of input)
of these: 13389795 (16.0%) have multiple alignments (96489 have >20)
92.2% overall read mapping rate.
Aligned pairs: 82664037
of these: 13116656 (15.9%) have multiple alignments
1478928 ( 1.8%) are discordant alignments
87.3% concordant pair alignment rate.
The fasta file for human genome I used consists of the concatenation of all the fasta
files in http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/.
genome annotation is gencode.v19.transcripts.patched_contigs.gtf
FastQC shows Per base sequence quality is very good, but fail for Per base sequence content, Per sequence GC content, Sequence Duplication Levels, Overrepresented sequences, and Kmer Content.
Blast hits of top overpresented sequences are 7SL RNA. I'm not sure whether high percentage of 7SL RNA in Reads accounts for higher multi-mapping.
Here shows the overpresented sequences:
>>Overrepresented sequences fail
#Sequence Count Percentage Possible Source
GGTGGCGCGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGTGGGAGGA 1471815 1.5822469615077897 No Hit
CCCAGCTACTCGGGAGGCTGAGGTGGGAGGATCGCTTGAGCCCAGGAGTT 1292236 1.3891939439066594 No Hit
GGTGGCGCGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCTGGAGGA 1083937 1.1652660318829942 No Hit
CCCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTT 925809 0.9952735073270521 No Hit
Thanks,
Fangfang
Hi Alex,
Thank you so much for your reponse. I have now got the alignment results by using the genome and reference files from the link you provided. I followed the options used in the previous run, however the unique mapper is even slightly less than the previous one (64.92% to
66.21%).
Here shows the log.final.out:
Started job on | Aug 02 16:56:36
Started mapping on | Aug 02 17:30:46
Finished on | Aug 03 09:52:02
Mapping speed, Million of reads per hour | 5.69
Number of input reads | 93020561
Average input read length | 202
UNIQUE READS:
Uniquely mapped reads number | 60388408
Uniquely mapped reads % | 64.92%
Average mapped length | 199.55
Number of splices: Total | 18974601
Number of splices: Annotated (sjdb) | 18972605
Number of splices: GT/AG | 18767641
Number of splices: GC/AG | 186246
Number of splices: AT/AC | 13759
Number of splices: Non-canonical | 6955
Mismatch rate per base, % | 0.26%
Deletion rate per base | 0.01%
Deletion average length | 1.86
Insertion rate per base | 0.01%
Insertion average length | 1.54
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 30079805
% of reads mapped to multiple loci | 32.34%
Number of reads mapped to too many loci | 1359383
% of reads mapped to too many loci | 1.46%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.05%
% of reads unmapped: too short | 0.96%
% of reads unmapped: other | 0.27%
I am trying to get the number of reads which are assigned to each Chromosome, but using samtools idxstats I get the file like this, I am not sure why it doesn't show the chr but the ENSTs.
ENST00000456328.2 1657 174 0
ENST00000515242.2 1653 172 0
ENST00000518655.2 1483 138 0
ENST00000450305.2 632 4 0
And Could you also let me know that in the summary whether the unique mapping for pair-end reads includes both disconcordantly and concordantly mapping or only concordant ones?
I much appreciate your suggestion. Thank you very much.
Best,
Fangfang
GL000220.1 161802 41149562 0
total 360437741 mapped 4042950 unmapped
ATATCATTCACGATATCTTCCATT(3) Is this is a mapping number ?? |