High percentage of multi-mapping reads

3,568 views
Skip to first unread message

Fangfang

unread,
Jul 26, 2017, 6:29:15 PM7/26/17
to rna-star

Hi All,


I am new in RNAseq data analysis. I have aligned my RNAseq data using STAR-RSEM pipeline created by others. The overall mapping rate is pretty good. However, the percentage of uniquely mapped reads is 66.2%, which is lower than expected. (I searched online, it says around 80% is good for uniquely mapping.)   I also tried Tophat it seems a little bit better than STAR for the uniquely mapped reads but the overall mapping rate is lower.


Is there anyway to improve the alignment or find out the reasons for multiple alignment?  Thanks in advance for your suggestions and commons.


the alignment summaries show as follows:

                      

                          Number of input reads | 93020561

                      Average input read length | 202

                                    UNIQUE READS:

                   Uniquely mapped reads number | 61590714

                        Uniquely mapped reads % | 66.21%

                          Average mapped length | 199.28

                       Number of splices: Total | 19079205

            Number of splices: Annotated (sjdb) | 19076527

                       Number of splices: GT/AG | 18851745

                       Number of splices: GC/AG | 204647

                       Number of splices: AT/AC | 13788

               Number of splices: Non-canonical | 9025

                      Mismatch rate per base, % | 0.26%

                         Deletion rate per base | 0.01%

                        Deletion average length | 1.80

                        Insertion rate per base | 0.01%

                       Insertion average length | 1.54

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci | 29905331

             % of reads mapped to multiple loci | 32.15%

        Number of reads mapped to too many loci | 576652

             % of reads mapped to too many loci | 0.62%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches | 0.05%

                 % of reads unmapped: too short | 0.81%

                     % of reads unmapped: other | 0.16%



 tophat2 summary:

Left reads:

          Input     :  93020561

           Mapped   :  87667354 (94.2% of input)

            of these:  15415147 (17.6%) have multiple alignments (96621 have >20)

Right reads:

          Input     :  93020561

           Mapped   :  83893284 (90.2% of input)

            of these:  13389795 (16.0%) have multiple alignments (96489 have >20)

92.2% overall read mapping rate.


Aligned pairs:  82664037

     of these:  13116656 (15.9%) have multiple alignments

                 1478928 ( 1.8%) are discordant alignments

87.3% concordant pair alignment rate.



The fasta file for human genome I used consists of the concatenation of all the fasta

files in  http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/.

genome annotation is gencode.v19.transcripts.patched_contigs.gtf

 

FastQC shows Per base sequence quality is very good, but fail for Per base sequence content, Per sequence GC content, Sequence Duplication Levels, Overrepresented sequences, and Kmer Content. 

Blast hits of top overpresented sequences are 7SL RNA. I'm not sure whether high percentage of 7SL RNA in Reads accounts for higher multi-mapping.


Here shows the overpresented sequences:


>>Overrepresented sequences fail

#Sequence Count Percentage Possible Source

GGTGGCGCGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGTGGGAGGA 1471815 1.5822469615077897 No Hit

CCCAGCTACTCGGGAGGCTGAGGTGGGAGGATCGCTTGAGCCCAGGAGTT 1292236 1.3891939439066594 No Hit

GGTGGCGCGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCTGGAGGA 1083937 1.1652660318829942 No Hit

CCCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTT 925809 0.9952735073270521 No Hit


Thanks,


Fangfang

Alexander Dobin

unread,
Jul 28, 2017, 4:19:08 PM7/28/17
to rna-star
Hi Fangfang,

the FASTA files you are using contain some haplotypes and patches that may increase the number of multimappers.
I would recommend using Gencode GTF and FASTA:
Please post the new Log.final.out file.

If your data is "total" rRNA depleted RNA, then the most likely reason for increased multimappers is incomplete rRNA depletion.
The most highly expressed rRNA locus is on chrUn_GL000220 scaffold. You can check the number of reads per chr/scaffold with 
$ samtools idxstats Aligned.sortedByCoordinate.bam

Cheers
Alex
Message has been deleted
Message has been deleted

Fangfang

unread,
Aug 4, 2017, 11:07:32 AM8/4/17
to rna-star

Hi Alex,

Thank you so much for your reponse. I have now got the alignment results by using the genome and reference files from the link you provided.  I followed the options used in the previous run, however the unique mapper is even slightly less than the previous one (64.92% to

66.21%).


Here shows the log.final.out:

                                 Started job on | Aug 02 16:56:36

                             Started mapping on | Aug 02 17:30:46

                                    Finished on | Aug 03 09:52:02

       Mapping speed, Million of reads per hour | 5.69


                          Number of input reads | 93020561

                      Average input read length | 202

                                    UNIQUE READS:

                   Uniquely mapped reads number | 60388408

                        Uniquely mapped reads % | 64.92%

                          Average mapped length | 199.55

                       Number of splices: Total | 18974601

            Number of splices: Annotated (sjdb) | 18972605

                       Number of splices: GT/AG | 18767641

                       Number of splices: GC/AG | 186246

                       Number of splices: AT/AC | 13759

               Number of splices: Non-canonical | 6955

                      Mismatch rate per base, % | 0.26%

                         Deletion rate per base | 0.01%

                        Deletion average length | 1.86

                        Insertion rate per base | 0.01%

                       Insertion average length | 1.54

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci | 30079805

             % of reads mapped to multiple loci | 32.34%

        Number of reads mapped to too many loci | 1359383

             % of reads mapped to too many loci | 1.46%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches | 0.05%

                 % of reads unmapped: too short | 0.96%

                     % of reads unmapped: other | 0.27%


I am trying to get the number of reads which are assigned to each Chromosome, but using samtools idxstats I get the file like this, I am not sure why it doesn't show the chr but the ENSTs.

ENST00000456328.2 1657 174 0

ENST00000515242.2 1653 172 0

ENST00000518655.2 1483 138 0

ENST00000450305.2 632 4 0


And Could you also let me know that in the summary whether the unique mapping for pair-end reads includes both disconcordantly and concordantly mapping or only concordant ones? 

I much appreciate your suggestion. Thank you very much.


Best,


Fangfang

Alexander Dobin

unread,
Aug 4, 2017, 12:49:12 PM8/4/17
to rna-star
Hi Fangfang,

it looks like you are mapping to the transcriptome (i.e. transcript sequences) rather than genome, that's why transcript IDs appear instead of chr.
It may also explain the high multimapping rate.
Which files are you using for the genome generation?
Please send me the Log.out file from the genome generation and mapping.

Cheers
Alex

Fangfang

unread,
Aug 5, 2017, 11:14:36 PM8/5/17
to rna-star
Hi Alex,

 I found I use the 'Aligned.toTranscriptome.out.bam' file for idxstats, by using the 'Aligned.out.bam' file I got the number of reads for each chr. And indeed the reads mapped to chrUn_GL000220 is pretty high.

GL000220.1      161802  41149562        0

total 360437741 mapped 4042950 unmapped


The files I used for genome generation are files you suggested
star_GRCh28.Log.out
S1.Log.out

Alexander Dobin

unread,
Aug 7, 2017, 1:06:18 PM8/7/17
to rna-star
Hi Fangfang,

~40M lines mapping to GL000220.1 correspond to ~10M PE reads mapping twice to rRNA loci. There are other rRNA loci in the genome - you can check how many reads are mapping to the rRNA track from RepeatMasker.
It looks like your sample has somewhat insufficient rRNA depletion. Still, 60M unique reads should be sufficient for most analyses.

Cheers
Alex

shweta shukla

unread,
Jun 6, 2018, 10:08:28 AM6/6/18
to rna-star
Hello
Could you pls tell what is the meaning of this number (3) representing in this nucleo
tide sequence
ATATCATTCACGATATCTTCCATT(3)
Is this is a mapping number ??

Alexander Dobin

unread,
Jun 7, 2018, 2:02:04 PM6/7/18
to rna-star
Hi Shweta,

did STAR generate this? I am not aware of such format.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages