High percentage of multi-mapping reads only when aligning to mouse genome

636 views
Skip to first unread message

Suraj Kannan

unread,
Jun 19, 2018, 4:23:05 PM6/19/18
to rna-star
Apologies in advance as this may not quite be a STAR problem but I figured I would check here first. I am running STAR through a program called zUMIs, which is designed to map and count UMI-tagged single cell RNA-seq. The STAR settings used for zUMIs are as such:

STAR --genomeDir "STARidx" --runThreadN "p" --readFilesCommand zcat --sjdbGTFfile "gtf" --outFileNamePrefix "sample." --outSAMtype BAM Unsorted --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --sjdbOverhang "readlength - 1" --twopassMode Basic --readFilesIn "cdnaread.filtered.fastq.gz"

where the genome directory is provided by the user without overhang/splice-site information. The GTF file is then provided by the user to insert junctions on the fly (as above).

I have been using this pipeline with success for samples mapped to the human genome. Our read lengths are typically 50 - 66 bp, depending on the experiment. I usually I see about 70 - 80% uniquely mapped reads and about 10% multimappers. However, for our experiments mapped to the mouse genome, I am routinely seeing 25 - 30% multimappers; corresponding, downstream, after counting (with feature counts), I am seeing a large number of pseudogenes in my counts tables. In many cases, the pseudogenes appear to be related to known genes of importance (moreover, the high percentage of pseudogenes doesn't necessarily correspond to our biological context). I have tried to play with --sjdbScore as a parameter (changing from 2 to 20, for example) but didn't see much difference.

Is this an issue with my reference GTF or with mapping settings? I am using Mus_musculus.GRCm38.92.gtf, downloaded from Ensembl (https://useast.ensembl.org/info/data/ftp/index.html).

Alexander Dobin

unread,
Jun 22, 2018, 4:02:41 PM6/22/18
to rna-star
Hi Suraj,

typically, the % of multimappers is determined by the RNA in your library and the genome assembly, and is not affected significantly by the mapping parameters.
If you can post Log.final.out, I can check if there is anything suspicious in it.
One thing to try is 1-pass alignment (i.e. omit --twopassMode Basic), sometimes 2-pass increases the % of multimappers.
Also, you may want to take a few multimappers and BLAT them against mouse genome in UCSC browse, just to get a 2nd opinion on where they map.

Cheers
Alex

Suraj Kannan

unread,
Jun 25, 2018, 10:50:57 AM6/25/18
to rna-...@googlegroups.com
Thanks so much for the response. I'll test your recommendations, but in the meantime, here is an example Log.final.out where we had issues:
                                 Started job on | Jun 14 21:10:05
                             Started mapping on | Jun 14 21:21:18
                                    Finished on | Jun 14 21:30:55
       Mapping speed, Million of reads per hour | 813.41

                          Number of input reads | 130371712
                      Average input read length | 60
                                    UNIQUE READS:
                   Uniquely mapped reads number | 77430509
                        Uniquely mapped reads % | 59.39%
                          Average mapped length | 59.18
                       Number of splices: Total | 5823745
            Number of splices: Annotated (sjdb) | 5776569
                       Number of splices: GT/AG | 5678374
                       Number of splices: GC/AG | 22112
                       Number of splices: AT/AC | 5403
               Number of splices: Non-canonical | 117856
                      Mismatch rate per base, % | 1.12%
                         Deletion rate per base | 0.03%
                        Deletion average length | 1.42
                        Insertion rate per base | 0.03%
                       Insertion average length | 1.27
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 38461826
             % of reads mapped to multiple loci | 29.50%
        Number of reads mapped to too many loci | 456
             % of reads mapped to too many loci | 0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 10.90%
                     % of reads unmapped: other | 0.21%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%

For comparison, here is what our mapping to the human reference looks like.

                                 Started job on | Mar 19 00:41:17
                             Started mapping on | Mar 19 01:03:00
                                    Finished on | Mar 19 01:40:06
       Mapping speed, Million of reads per hour | 961.75

                          Number of input reads | 594681159
                      Average input read length | 66
                                    UNIQUE READS:
                   Uniquely mapped reads number | 459120957
                        Uniquely mapped reads % | 77.20%
                          Average mapped length | 65.53
                       Number of splices: Total | 128971189
            Number of splices: Annotated (sjdb) | 128712920
                       Number of splices: GT/AG | 127927630
                       Number of splices: GC/AG | 610176
                       Number of splices: AT/AC | 37485
               Number of splices: Non-canonical | 395898
                      Mismatch rate per base, % | 0.96%
                         Deletion rate per base | 0.01%
                        Deletion average length | 1.28
                        Insertion rate per base | 0.00%
                       Insertion average length | 1.19
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 97235568
             % of reads mapped to multiple loci | 16.35%
        Number of reads mapped to too many loci | 1127
             % of reads mapped to too many loci | 0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 4.22%
                     % of reads unmapped: other | 2.23%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%

The difference in multi-mappers is about 10-15%, but certainly it seems to significantly change the percentage of pseudogenes called later. I figured it likely had to do with the genome assembly but I'm not exactly sure why the mouse and human genomes (from the same source) would differ.

Alexander Dobin

unread,
Jun 28, 2018, 11:27:57 PM6/28/18
to rna-star
Hi Suraj,

one thing that stands out in the Log.final.out is the low proportion of spliced reads.
Are these are 3' end reads, Drop-seq like?
It's possible that for some reason the ends of 3' ends of genes and pseudogenes have more sequence similarity in mouse than in human for some reason.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages