How to get all the reads termed to be multi mapped ?

3,289 views
Skip to first unread message

rvenkatr

unread,
May 19, 2016, 6:35:05 AM5/19/16
to rna-star
Hi Alex, 
I have trying to map 50bp reads to a mouse reference genome and my sole aim is to perform Differential gene expression analysis. The command I use is : 

STAR \

--genomeDir ${ref_genome} \

--readFilesIn reads.fastq \

--outFileNamePrefix reads \

--outFilterIntronMotifs RemoveNoncanonical \

--outSJfilterReads Unique \

--outFilterMultimapNmax 1 \

--outFilterMismatchNoverLmax 0.03 \

--outStd SAM > reads.sam


From my understanding, I am filtering out only those reads that are mapped uniquely, removing non canonical reads, allowing only 1 alignment that can be multi mapped and if there is more, it is considered as unmapped. I get the following as the result:


Number of input reads |       31515501

                      Average input read length |       50

                                    UNIQUE READS:

                   Uniquely mapped reads number |       24654387

                        Uniquely mapped reads % |       78.23%

                          Average mapped length |       50.75

                       Number of splices: Total |       3656883

            Number of splices: Annotated (sjdb) |       3639208

                       Number of splices: GT/AG |       3637727

                       Number of splices: GC/AG |       17677

                       Number of splices: AT/AC |       1479

               Number of splices: Non-canonical |       0

                      Mismatch rate per base, % |       0.17%

                         Deletion rate per base |       0.00%

                        Deletion average length |       1.49

                        Insertion rate per base |       0.00%

                       Insertion average length |       1.08

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       0

             % of reads mapped to multiple loci |       0.00%

        Number of reads mapped to too many loci |       5898515

             % of reads mapped to too many loci |       18.72%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.03%

                 % of reads unmapped: too short |       2.25%

                     % of reads unmapped: other |       0.78%


Although the number of unmapped reads is less, almost 19% of the reads go into multimapping reads? I am not going to use them while I count features using a tool like HTSeq downstream.  

My question is :  How do I get only those reads that gets mapped to different loci ? I would like to investigate which reads .


Thanks, 

RV


Alexander Dobin

unread,
May 20, 2016, 3:39:38 PM5/20/16
to rna-star
Hi RV,

the multimapping reads (with your parameters, they are counted as "reads mapped to too many loci") can be output along with other unmapped reads using --outReadsUnmapped Fastx option.
However, you won't be able to tell which reads are multimappers, and which cannot be mapped at all.
If you wan to get just the multimappers, the best way is to allow them (i.e. use --outFilterMultimapNmax 100), and then filter them from the sam file by the MAPQ field: the unique mappers will have MAPQ=255, while multimappers will have MAPQ<255. Also, unique mappers have NH:i:1 attribute, while multimappers have NH:i:<Nmult>, where Nmult is the number of loci they map to.

Cheers
Alex

rvenkatr

unread,
Jun 14, 2016, 5:49:16 PM6/14/16
to rna-star
Hi Alex, 
Thank you for the response. Yes, I filtered the multi mappers successfully. I was also wondering if you can point me towards the best way of to identify these and annotate ? I just want to know the composition of these multi mapping reads. I tried using HT-Seq and I see there is no count for any feature. If this question is irrelevant here, kindly apologize.

Thanks a lot, 
RV

Alexander Dobin

unread,
Jun 15, 2016, 12:59:12 PM6/15/16
to rna-star
Hi RV,

I think you can trick htseq into counting the multimapping reads by replacing the HI:i:<X> tag in the SAM file with HI:i:1.
Of course, the same read will be counted towards many genes.
Also, I have heard that featureCounts can count multimappers.
An even more general option is to use bedtools intersect against all the annotated exons, and then aggregate them by gene.

Cheers
Alex

PDK

unread,
Jul 11, 2017, 3:11:59 PM7/11/17
to rna-star
Hi RV,
Have you got an answer to your question on "the best way of to identify the multimaps and annotate them?" I am interested on that as well.
PDK

Alexander Dobin

unread,
Jul 12, 2017, 1:12:35 PM7/12/17
to rna-star
Hi @PDK,

bedtools intersect tool can "annotate" the alignments in the BAM file. It will basically report intersection between the BAM alignments and GTF features. However, you would need to write a script to "collapse" these annotations.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages