STAR - high number of reads mapped to multiple loci

2,149 views
Skip to first unread message

Vempalli Fazulur

unread,
Apr 6, 2018, 3:06:18 PM4/6/18
to rna-star
Dear Alex,

Thanks for excellent tool which align rnaseq data very fast.

I am using STAR to align our dataset. After removing adapters, trimmed reads were submitted to star and I am getting good alignment rate around 95% average across all samples. 

Here is the command I used:

STAR --runThreadN 32 --genomeDir STARIndex --readFilesIn test_1.fastq test_2.fastq  --outFileNamePrefix sample1 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 30000000000 --twopassMode Basic --outSAMunmapped Within --outFilterMultimapNmax 10 --outTmpDir tmpdir

Star log final out file :

       Mapping speed, Million of reads per hour | 107.07

                          Number of input reads | 23585104
                      Average input read length | 276
                                    UNIQUE READS:
                   Uniquely mapped reads number | 15965778
                        Uniquely mapped reads % | 67.69%
                          Average mapped length | 273.29
                       Number of splices: Total | 10931770
            Number of splices: Annotated (sjdb) | 10870946
                       Number of splices: GT/AG | 10809839
                       Number of splices: GC/AG | 73575
                       Number of splices: AT/AC | 3369
               Number of splices: Non-canonical | 44987
                      Mismatch rate per base, % | 0.43%
                         Deletion rate per base | 0.01%
                        Deletion average length | 1.51
                        Insertion rate per base | 0.03%
                       Insertion average length | 1.15
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 6558742
             % of reads mapped to multiple loci | 27.81%
        Number of reads mapped to too many loci | 10347
             % of reads mapped to too many loci | 0.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 4.25%
                     % of reads unmapped: other | 0.20%

Following, I ran featureCounts and only few reads (20 to 30%) were assigned to features on reverse strand (Trueseq stranded library used while sequencing by default "reverse" stranded). I observed reads were assigned  to both forward and reverse strands which should not be for stranded library.

And also, I observed %GC is high (49-55%) across all samples before trimming and after trimming.
Even though all adapters were removed. We are still having over represented sequences in other pair (R2) read with "no hit". After observing these, I am afraid if data is having any contamination. Please let me know what check s do I need to perform for contamination in data.

I have checked across related posts before posting this. 

Could you please suggest me how can I decrease multimapped reads so that % of uniquely mapped reads will increase.

Thanks In Advance
Fazulur Rehaman

Alexander Dobin

unread,
Apr 7, 2018, 4:46:38 PM4/7/18
to rna-star
Hi Fazulur,

I do not think you have a contamination problem, as you map ~95% of the reads to your reference genome, so contamination cannot be >5%. Your mapped length is close to the read length, and mismatch rate is low.
You may want to investigate the multimappers, as they often are indicative of incomplete depletion of rRNA.
What is the theoretical GC content in the genes of your species?

What parameters you run featureCounts with? Please post the "summary" file.

Cheers
Alex

Vempalli Fazulur

unread,
Apr 8, 2018, 6:36:27 AM4/8/18
to rna-star
Dear Alex,

Thanks a lot for your response and confirmation on no contamination.

Here is the command I used for featureCounts

featureCounts -t exon -s 0 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam 

Following summary file

Status  sample.starAligned.sortedByCoord.out.bam
Assigned        24836463
Unassigned_Ambiguity    2009092
Unassigned_MultiMapping 11665810
Unassigned_NoFeatures   10974192
Unassigned_Unmapped     2054504
Unassigned_MappingQuality       0
Unassigned_FragmentLength       0
Unassigned_Chimera      0
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_Duplicate    0

What is the theoretical GC content in the genes of your species?
I have no idea right now how I can get this. Could you please help me?

When I run rseqc bam_stat.py on bam file with hg19 rRNA bed file, few reads mapped to rRNA. 

Total records:                                         51540061
splitbam_rseqc.in.bam (Reads consumed by input gene list):330936
splitbam_rseqc.ex.bam (Reads not consumed by input gene list):49154621
splitbam_rseqc.junk.bam (qcfailed, unmapped reads):    2054504

Could you please suggest how can I decrease multimapping rate on bam.

Thanks & Regards
Fazulur Rehaman

Alexander Dobin

unread,
Apr 9, 2018, 4:57:08 PM4/9/18
to rna-star
Hi Fazulur,

you used -s 0 option for feaureCounts, so it was counting reads from both sense and antisense to the gene annotations.
If you want to count only reads in the sense orientations to the genes, you would need to use -s1 or -s2, depending on your library strandedness.

To estimate GC content of the genes in your species, you would need to get the FASTA sequences of the genes, and then count the ACGT bases.
There must be a tool out there that can do it, but I am not aware of one.

The multimapping rate cannot be reduced by tweaking mapping parameters - it's a property of your library rather than mapping.
What you can do it count multimappers and see where they map. featurCounts has the -M option to count all multimapping alignments, and --primary option to count each multimapping read only once. If you want to use the latter option, it's better to re-run STAR with --outMultimapperOrder Random option.

Cheers
Alex

Vempalli Fazulur

unread,
Apr 10, 2018, 2:53:32 AM4/10/18
to rna-star
Dear Alex,

Thanks a lot for detailed explanation.

And yes. I agree with you. Our data is of reverse stranded. When I ran featureCounts and around 50% got assigned to features. 20-35% were not assigned due to multimapping issues. I found rRNA contamination in data. I will remove and process again.

Other that rRNA do I need to check anything. Please suggest?

I will try to estimate GC content too.

And also I tried -M option but I heard that it is not recommended. I will try with --primary after rerunning alignment using -outMultimapperOrder option.

Thanks & Regards
Fazulur Rehaman

Alexander Dobin

unread,
Apr 16, 2018, 5:44:30 PM4/16/18
to rna-star
Hi Fazulur,

if removal of rRNA reads brings down the % of multimapping reads to below 10%, I think you won't need to worry about mulitmappers.

Cheers
Alex

Vempalli Fazulur

unread,
Apr 17, 2018, 12:36:42 AM4/17/18
to rna-star
Dear Alex,

Thanks a lot for your confirmation on rRNA.

I have removed rRNA from all samples and I observed in few samples still multimapping reads were around 15%.

Do I need to perform any other filters on data in order to decrease multimapping rate.

Please suggest me.

Thanks & Regards
Fazulur Rehaman 

Alexander Dobin

unread,
Apr 19, 2018, 5:40:47 PM4/19/18
to rna-star
Hi Fazulur,

15% is on the high side, so you may want to try to figure out which genes these multimappers are coming from, by counting them with featureCounts.
Another possibility is to inetrsect them with the RepeatMasker track using bedtools.

Cheers
Alex

Vempalli Fazulur

unread,
Apr 21, 2018, 11:54:15 PM4/21/18
to rna-star
Dear Alex,

Thanks for your suggestions. I will try the same.

Thanks & Regards
Fazulur Rehaman
Reply all
Reply to author
Forward
0 new messages