How to get all the reads termed to be multi mapped ?

rvenkatr

unread,

May 19, 2016, 6:35:05 AM5/19/16

to rna-star

Hi Alex,

I have trying to map 50bp reads to a mouse reference genome and my sole aim is to perform Differential gene expression analysis. The command I use is :

STAR \

--genomeDir ${ref_genome} \

--readFilesIn reads.fastq \

--outFileNamePrefix reads \

--outFilterIntronMotifs RemoveNoncanonical \

--outSJfilterReads Unique \

--outFilterMultimapNmax 1 \

--outFilterMismatchNoverLmax 0.03 \

--outStd SAM > reads.sam

From my understanding, I am filtering out only those reads that are mapped uniquely, removing non canonical reads, allowing only 1 alignment that can be multi mapped and if there is more, it is considered as unmapped. I get the following as the result:

Number of input reads | 31515501

Average input read length | 50

UNIQUE READS:

Uniquely mapped reads number | 24654387

Uniquely mapped reads % | 78.23%

Average mapped length | 50.75

Number of splices: Total | 3656883

Number of splices: Annotated (sjdb) | 3639208

Number of splices: GT/AG | 3637727

Number of splices: GC/AG | 17677

Number of splices: AT/AC | 1479

Number of splices: Non-canonical | 0

Mismatch rate per base, % | 0.17%

Deletion rate per base | 0.00%

Deletion average length | 1.49

Insertion rate per base | 0.00%

Insertion average length | 1.08

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 0

% of reads mapped to multiple loci | 0.00%

Number of reads mapped to too many loci | 5898515

% of reads mapped to too many loci | 18.72%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.03%

% of reads unmapped: too short | 2.25%

% of reads unmapped: other | 0.78%

Although the number of unmapped reads is less, almost 19% of the reads go into multimapping reads? I am not going to use them while I count features using a tool like HTSeq downstream.

My question is : How do I get only those reads that gets mapped to different loci ? I would like to investigate which reads .

Thanks,

RV

Alexander Dobin

unread,

May 20, 2016, 3:39:38 PM5/20/16

to rna-star

Hi RV,

the multimapping reads (with your parameters, they are counted as "reads mapped to too many loci") can be output along with other unmapped reads using --outReadsUnmapped Fastx option.

However, you won't be able to tell which reads are multimappers, and which cannot be mapped at all.

If you wan to get just the multimappers, the best way is to allow them (i.e. use --outFilterMultimapNmax 100), and then filter them from the sam file by the MAPQ field: the unique mappers will have MAPQ=255, while multimappers will have MAPQ<255. Also, unique mappers have NH:i:1 attribute, while multimappers have NH:i:<Nmult>, where Nmult is the number of loci they map to.

Cheers

Alex

rvenkatr

unread,

Jun 14, 2016, 5:49:16 PM6/14/16

to rna-star

Hi Alex,

Thank you for the response. Yes, I filtered the multi mappers successfully. I was also wondering if you can point me towards the best way of to identify these and annotate ? I just want to know the composition of these multi mapping reads. I tried using HT-Seq and I see there is no count for any feature. If this question is irrelevant here, kindly apologize.

Thanks a lot,

RV

Alexander Dobin

unread,

Jun 15, 2016, 12:59:12 PM6/15/16

to rna-star

Hi RV,

I think you can trick htseq into counting the multimapping reads by replacing the HI:i:<X> tag in the SAM file with HI:i:1.

Of course, the same read will be counted towards many genes.

Also, I have heard that featureCounts can count multimappers.

An even more general option is to use bedtools intersect against all the annotated exons, and then aggregate them by gene.

Cheers

Alex

PDK

unread,

Jul 11, 2017, 3:11:59 PM7/11/17

to rna-star

Hi RV,

Have you got an answer to your question on "the best way of to identify the multimaps and annotate them?" I am interested on that as well.

PDK

Alexander Dobin

unread,

Jul 12, 2017, 1:12:35 PM7/12/17

to rna-star

Hi @PDK,

bedtools intersect tool can "annotate" the alignments in the BAM file. It will basically report intersection between the BAM alignments and GTF features. However, you would need to write a script to "collapse" these annotations.