High percentage of multi-mapping reads

Fangfang

unread,

Jul 26, 2017, 6:29:15 PM7/26/17

to rna-star

Hi All,

I am new in RNAseq data analysis. I have aligned my RNAseq data using STAR-RSEM pipeline created by others. The overall mapping rate is pretty good. However, the percentage of uniquely mapped reads is 66.2%, which is lower than expected. (I searched online, it says around 80% is good for uniquely mapping.) I also tried Tophat it seems a little bit better than STAR for the uniquely mapped reads but the overall mapping rate is lower.

Is there anyway to improve the alignment or find out the reasons for multiple alignment? Thanks in advance for your suggestions and commons.

the alignment summaries show as follows:

Number of input reads | 93020561

Average input read length | 202

UNIQUE READS:

Uniquely mapped reads number | 61590714

Uniquely mapped reads % | 66.21%

Average mapped length | 199.28

Number of splices: Total | 19079205

Number of splices: Annotated (sjdb) | 19076527

Number of splices: GT/AG | 18851745

Number of splices: GC/AG | 204647

Number of splices: AT/AC | 13788

Number of splices: Non-canonical | 9025

Mismatch rate per base, % | 0.26%

Deletion rate per base | 0.01%

Deletion average length | 1.80

Insertion rate per base | 0.01%

Insertion average length | 1.54

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 29905331

% of reads mapped to multiple loci | 32.15%

Number of reads mapped to too many loci | 576652

% of reads mapped to too many loci | 0.62%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.05%

% of reads unmapped: too short | 0.81%

% of reads unmapped: other | 0.16%

tophat2 summary:

Left reads:

Input : 93020561

Mapped : 87667354 (94.2% of input)

of these: 15415147 (17.6%) have multiple alignments (96621 have >20)

Right reads:

Input : 93020561

Mapped : 83893284 (90.2% of input)

of these: 13389795 (16.0%) have multiple alignments (96489 have >20)

92.2% overall read mapping rate.

Aligned pairs: 82664037

of these: 13116656 (15.9%) have multiple alignments

1478928 ( 1.8%) are discordant alignments

87.3% concordant pair alignment rate.

The fasta file for human genome I used consists of the concatenation of all the fasta

files in http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/.

genome annotation is gencode.v19.transcripts.patched_contigs.gtf

FastQC shows Per base sequence quality is very good, but fail for Per base sequence content, Per sequence GC content, Sequence Duplication Levels, Overrepresented sequences, and Kmer Content.

Blast hits of top overpresented sequences are 7SL RNA. I'm not sure whether high percentage of 7SL RNA in Reads accounts for higher multi-mapping.

Here shows the overpresented sequences:

>>Overrepresented sequences fail

#Sequence Count Percentage Possible Source

GGTGGCGCGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGTGGGAGGA 1471815 1.5822469615077897 No Hit

CCCAGCTACTCGGGAGGCTGAGGTGGGAGGATCGCTTGAGCCCAGGAGTT 1292236 1.3891939439066594 No Hit

GGTGGCGCGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCTGGAGGA 1083937 1.1652660318829942 No Hit

CCCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTT 925809 0.9952735073270521 No Hit

Thanks,

Fangfang

Alexander Dobin

unread,

Jul 28, 2017, 4:19:08 PM7/28/17

to rna-star

Hi Fangfang,

the FASTA files you are using contain some haplotypes and patches that may increase the number of multimappers.

I would recommend using Gencode GTF and FASTA:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.primary_assembly.annotation.gtf.gz

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh38.primary_assembly.genome.fa.gz

Please post the new Log.final.out file.

If your data is "total" rRNA depleted RNA, then the most likely reason for increased multimappers is incomplete rRNA depletion.

The most highly expressed rRNA locus is on chrUn_GL000220 scaffold. You can check the number of reads per chr/scaffold with

$ samtools idxstats Aligned.sortedByCoordinate.bam

Cheers

Alex

Message has been deleted

Fangfang

unread,

Aug 4, 2017, 11:07:32 AM8/4/17

to rna-star

Hi Alex,

Thank you so much for your reponse. I have now got the alignment results by using the genome and reference files from the link you provided. I followed the options used in the previous run, however the unique mapper is even slightly less than the previous one (64.92% to

66.21%).

Here shows the log.final.out:

Started job on | Aug 02 16:56:36

Started mapping on | Aug 02 17:30:46

Finished on | Aug 03 09:52:02

Mapping speed, Million of reads per hour | 5.69

Number of input reads | 93020561

Average input read length | 202

UNIQUE READS:

Uniquely mapped reads number | 60388408

Uniquely mapped reads % | 64.92%

Average mapped length | 199.55

Number of splices: Total | 18974601

Number of splices: Annotated (sjdb) | 18972605

Number of splices: GT/AG | 18767641

Number of splices: GC/AG | 186246

Number of splices: AT/AC | 13759

Number of splices: Non-canonical | 6955

Mismatch rate per base, % | 0.26%

Deletion rate per base | 0.01%

Deletion average length | 1.86

Insertion rate per base | 0.01%

Insertion average length | 1.54

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 30079805

% of reads mapped to multiple loci | 32.34%

Number of reads mapped to too many loci | 1359383

% of reads mapped to too many loci | 1.46%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.05%

% of reads unmapped: too short | 0.96%

% of reads unmapped: other | 0.27%

I am trying to get the number of reads which are assigned to each Chromosome, but using samtools idxstats I get the file like this, I am not sure why it doesn't show the chr but the ENSTs.

ENST00000456328.2 1657 174 0

ENST00000515242.2 1653 172 0

ENST00000518655.2 1483 138 0

ENST00000450305.2 632 4 0

And Could you also let me know that in the summary whether the unique mapping for pair-end reads includes both disconcordantly and concordantly mapping or only concordant ones?

I much appreciate your suggestion. Thank you very much.

Best,

Fangfang

Alexander Dobin

unread,

Aug 4, 2017, 12:49:12 PM8/4/17

to rna-star

Hi Fangfang,

it looks like you are mapping to the transcriptome (i.e. transcript sequences) rather than genome, that's why transcript IDs appear instead of chr.

It may also explain the high multimapping rate.

Which files are you using for the genome generation?

Please send me the Log.out file from the genome generation and mapping.

Cheers

Alex

Fangfang

unread,

Aug 5, 2017, 11:14:36 PM8/5/17

to rna-star

Hi Alex,

I found I use the 'Aligned.toTranscriptome.out.bam' file for idxstats, by using the 'Aligned.out.bam' file I got the number of reads for each chr. And indeed the reads mapped to chrUn_GL000220 is pretty high.

GL000220.1 161802 41149562 0

total 360437741 mapped 4042950 unmapped

The files I used for genome generation are files you suggested

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.primary_assembly.annotation.gtf.gz

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh38.primary_assembly.genome.fa.gz

Attached please find the Log.out files from genome generation and mapping.

Thanks,

Fangfang

star_GRCh28.Log.out

S1.Log.out

Alexander Dobin

unread,

Aug 7, 2017, 1:06:18 PM8/7/17

to rna-star

Hi Fangfang,

~40M lines mapping to GL000220.1 correspond to ~10M PE reads mapping twice to rRNA loci. There are other rRNA loci in the genome - you can check how many reads are mapping to the rRNA track from RepeatMasker.

It looks like your sample has somewhat insufficient rRNA depletion. Still, 60M unique reads should be sufficient for most analyses.

Cheers

Alex

shweta shukla

unread,

Jun 6, 2018, 10:08:28 AM6/6/18

to rna-star

Hello

Could you pls tell what is the meaning of this number (3) representing in this nucleo

tide sequence

ATATCATTCACGATATCTTCCATT(3)
Is this is a mapping number ??

Alexander Dobin

unread,

Jun 7, 2018, 2:02:04 PM6/7/18

to rna-star

Hi Shweta,

did STAR generate this? I am not aware of such format.

Cheers

Alex

Reply all

Reply to author

Forward