STAR - high number of reads mapped to multiple loci

Vempalli Fazulur

unread,

Apr 6, 2018, 3:06:18 PM4/6/18

to rna-star

Dear Alex,

Thanks for excellent tool which align rnaseq data very fast.

I am using STAR to align our dataset. After removing adapters, trimmed reads were submitted to star and I am getting good alignment rate around 95% average across all samples.

Here is the command I used:

STAR --runThreadN 32 --genomeDir STARIndex --readFilesIn test_1.fastq test_2.fastq --outFileNamePrefix sample1 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 30000000000 --twopassMode Basic --outSAMunmapped Within --outFilterMultimapNmax 10 --outTmpDir tmpdir

Star log final out file :

Mapping speed, Million of reads per hour | 107.07

Number of input reads | 23585104

Average input read length | 276

UNIQUE READS:

Uniquely mapped reads number | 15965778

Uniquely mapped reads % | 67.69%

Average mapped length | 273.29

Number of splices: Total | 10931770

Number of splices: Annotated (sjdb) | 10870946

Number of splices: GT/AG | 10809839

Number of splices: GC/AG | 73575

Number of splices: AT/AC | 3369

Number of splices: Non-canonical | 44987

Mismatch rate per base, % | 0.43%

Deletion rate per base | 0.01%

Deletion average length | 1.51

Insertion rate per base | 0.03%

Insertion average length | 1.15

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 6558742

% of reads mapped to multiple loci | 27.81%

Number of reads mapped to too many loci | 10347

% of reads mapped to too many loci | 0.04%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 4.25%

% of reads unmapped: other | 0.20%

Following, I ran featureCounts and only few reads (20 to 30%) were assigned to features on reverse strand (Trueseq stranded library used while sequencing by default "reverse" stranded). I observed reads were assigned to both forward and reverse strands which should not be for stranded library.

And also, I observed %GC is high (49-55%) across all samples before trimming and after trimming.

Even though all adapters were removed. We are still having over represented sequences in other pair (R2) read with "no hit". After observing these, I am afraid if data is having any contamination. Please let me know what check s do I need to perform for contamination in data.

I have checked across related posts before posting this.

Could you please suggest me how can I decrease multimapped reads so that % of uniquely mapped reads will increase.

Thanks In Advance

Fazulur Rehaman

Alexander Dobin

unread,

Apr 7, 2018, 4:46:38 PM4/7/18

to rna-star

Hi Fazulur,

I do not think you have a contamination problem, as you map ~95% of the reads to your reference genome, so contamination cannot be >5%. Your mapped length is close to the read length, and mismatch rate is low.

You may want to investigate the multimappers, as they often are indicative of incomplete depletion of rRNA.

What is the theoretical GC content in the genes of your species?

What parameters you run featureCounts with? Please post the "summary" file.

Cheers

Alex

Vempalli Fazulur

unread,

Apr 8, 2018, 6:36:27 AM4/8/18

to rna-star

Dear Alex,

Thanks a lot for your response and confirmation on no contamination.

Here is the command I used for featureCounts

featureCounts -t exon -s 0 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam

Following summary file

Status sample.starAligned.sortedByCoord.out.bam

Assigned 24836463

Unassigned_Ambiguity 2009092

Unassigned_MultiMapping 11665810

Unassigned_NoFeatures 10974192

Unassigned_Unmapped 2054504

Unassigned_MappingQuality 0

Unassigned_FragmentLength 0

Unassigned_Chimera 0

Unassigned_Secondary 0

Unassigned_Nonjunction 0

Unassigned_Duplicate 0

What is the theoretical GC content in the genes of your species?

I have no idea right now how I can get this. Could you please help me?

When I run rseqc bam_stat.py on bam file with hg19 rRNA bed file, few reads mapped to rRNA.

Total records: 51540061

splitbam_rseqc.in.bam (Reads consumed by input gene list):330936

splitbam_rseqc.ex.bam (Reads not consumed by input gene list):49154621

splitbam_rseqc.junk.bam (qcfailed, unmapped reads): 2054504

Could you please suggest how can I decrease multimapping rate on bam.

Thanks & Regards

Fazulur Rehaman

Alexander Dobin

unread,

Apr 9, 2018, 4:57:08 PM4/9/18

to rna-star

Hi Fazulur,

you used -s 0 option for feaureCounts, so it was counting reads from both sense and antisense to the gene annotations.

If you want to count only reads in the sense orientations to the genes, you would need to use -s1 or -s2, depending on your library strandedness.

To estimate GC content of the genes in your species, you would need to get the FASTA sequences of the genes, and then count the ACGT bases.

There must be a tool out there that can do it, but I am not aware of one.

The multimapping rate cannot be reduced by tweaking mapping parameters - it's a property of your library rather than mapping.

What you can do it count multimappers and see where they map. featurCounts has the -M option to count all multimapping alignments, and --primary option to count each multimapping read only once. If you want to use the latter option, it's better to re-run STAR with --outMultimapperOrder Random option.

Cheers

Alex

Vempalli Fazulur

unread,

Apr 10, 2018, 2:53:32 AM4/10/18

to rna-star

Dear Alex,

Thanks a lot for detailed explanation.

And yes. I agree with you. Our data is of reverse stranded. When I ran featureCounts and around 50% got assigned to features. 20-35% were not assigned due to multimapping issues. I found rRNA contamination in data. I will remove and process again.

Other that rRNA do I need to check anything. Please suggest?

I will try to estimate GC content too.

And also I tried -M option but I heard that it is not recommended. I will try with --primary after rerunning alignment using -outMultimapperOrder option.

Thanks & Regards

Fazulur Rehaman

Alexander Dobin

unread,

Apr 16, 2018, 5:44:30 PM4/16/18

to rna-star

Hi Fazulur,

if removal of rRNA reads brings down the % of multimapping reads to below 10%, I think you won't need to worry about mulitmappers.

Cheers

Alex

Vempalli Fazulur

unread,

Apr 17, 2018, 12:36:42 AM4/17/18

to rna-star

Dear Alex,

Thanks a lot for your confirmation on rRNA.

I have removed rRNA from all samples and I observed in few samples still multimapping reads were around 15%.

Do I need to perform any other filters on data in order to decrease multimapping rate.

Please suggest me.

Thanks & Regards

Fazulur Rehaman

Alexander Dobin

unread,

Apr 19, 2018, 5:40:47 PM4/19/18

to rna-star

Hi Fazulur,

15% is on the high side, so you may want to try to figure out which genes these multimappers are coming from, by counting them with featureCounts.

Another possibility is to inetrsect them with the RepeatMasker track using bedtools.

Cheers

Alex

Vempalli Fazulur

unread,

Apr 21, 2018, 11:54:15 PM4/21/18

to rna-star

Dear Alex,

Thanks for your suggestions. I will try the same.

Thanks & Regards

Fazulur Rehaman

Reply all

Reply to author

Forward