High percentage of multi-mapping reads only when aligning to mouse genome

Suraj Kannan

unread,

Jun 19, 2018, 4:23:05 PM6/19/18

to rna-star

Apologies in advance as this may not quite be a STAR problem but I figured I would check here first. I am running STAR through a program called zUMIs, which is designed to map and count UMI-tagged single cell RNA-seq. The STAR settings used for zUMIs are as such:

STAR --genomeDir "STARidx" --runThreadN "p" --readFilesCommand zcat --sjdbGTFfile "gtf" --outFileNamePrefix "sample." --outSAMtype BAM Unsorted --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --sjdbOverhang "readlength - 1" --twopassMode Basic --readFilesIn "cdnaread.filtered.fastq.gz"

where the genome directory is provided by the user without overhang/splice-site information. The GTF file is then provided by the user to insert junctions on the fly (as above).

I have been using this pipeline with success for samples mapped to the human genome. Our read lengths are typically 50 - 66 bp, depending on the experiment. I usually I see about 70 - 80% uniquely mapped reads and about 10% multimappers. However, for our experiments mapped to the mouse genome, I am routinely seeing 25 - 30% multimappers; corresponding, downstream, after counting (with feature counts), I am seeing a large number of pseudogenes in my counts tables. In many cases, the pseudogenes appear to be related to known genes of importance (moreover, the high percentage of pseudogenes doesn't necessarily correspond to our biological context). I have tried to play with --sjdbScore as a parameter (changing from 2 to 20, for example) but didn't see much difference.

Is this an issue with my reference GTF or with mapping settings? I am using Mus_musculus.GRCm38.92.gtf, downloaded from Ensembl (https://useast.ensembl.org/info/data/ftp/index.html).

Alexander Dobin

unread,

Jun 22, 2018, 4:02:41 PM6/22/18

to rna-star

Hi Suraj,

typically, the % of multimappers is determined by the RNA in your library and the genome assembly, and is not affected significantly by the mapping parameters.

If you can post Log.final.out, I can check if there is anything suspicious in it.

One thing to try is 1-pass alignment (i.e. omit --twopassMode Basic), sometimes 2-pass increases the % of multimappers.

Also, you may want to take a few multimappers and BLAT them against mouse genome in UCSC browse, just to get a 2nd opinion on where they map.

Cheers

Alex

Suraj Kannan

unread,

Jun 25, 2018, 10:50:57 AM6/25/18

to rna-...@googlegroups.com

Thanks so much for the response. I'll test your recommendations, but in the meantime, here is an example Log.final.out where we had issues:

Started job on | Jun 14 21:10:05

Started mapping on | Jun 14 21:21:18

Finished on | Jun 14 21:30:55

Mapping speed, Million of reads per hour | 813.41

Number of input reads | 130371712

Average input read length | 60

UNIQUE READS:

Uniquely mapped reads number | 77430509

Uniquely mapped reads % | 59.39%

Average mapped length | 59.18

Number of splices: Total | 5823745

Number of splices: Annotated (sjdb) | 5776569

Number of splices: GT/AG | 5678374

Number of splices: GC/AG | 22112

Number of splices: AT/AC | 5403

Number of splices: Non-canonical | 117856

Mismatch rate per base, % | 1.12%

Deletion rate per base | 0.03%

Deletion average length | 1.42

Insertion rate per base | 0.03%

Insertion average length | 1.27

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 38461826

% of reads mapped to multiple loci | 29.50%

Number of reads mapped to too many loci | 456

% of reads mapped to too many loci | 0.00%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 10.90%

% of reads unmapped: other | 0.21%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

For comparison, here is what our mapping to the human reference looks like.

Started job on | Mar 19 00:41:17

Started mapping on | Mar 19 01:03:00

Finished on | Mar 19 01:40:06

Mapping speed, Million of reads per hour | 961.75

Number of input reads | 594681159

Average input read length | 66

UNIQUE READS:

Uniquely mapped reads number | 459120957

Uniquely mapped reads % | 77.20%

Average mapped length | 65.53

Number of splices: Total | 128971189

Number of splices: Annotated (sjdb) | 128712920

Number of splices: GT/AG | 127927630

Number of splices: GC/AG | 610176

Number of splices: AT/AC | 37485

Number of splices: Non-canonical | 395898

Mismatch rate per base, % | 0.96%

Deletion rate per base | 0.01%

Deletion average length | 1.28

Insertion rate per base | 0.00%

Insertion average length | 1.19

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 97235568

% of reads mapped to multiple loci | 16.35%

Number of reads mapped to too many loci | 1127

% of reads mapped to too many loci | 0.00%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 4.22%

% of reads unmapped: other | 2.23%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

The difference in multi-mappers is about 10-15%, but certainly it seems to significantly change the percentage of pseudogenes called later. I figured it likely had to do with the genome assembly but I'm not exactly sure why the mouse and human genomes (from the same source) would differ.

Alexander Dobin

unread,

Jun 28, 2018, 11:27:57 PM6/28/18

to rna-star

Hi Suraj,

one thing that stands out in the Log.final.out is the low proportion of spliced reads.

Are these are 3' end reads, Drop-seq like?

It's possible that for some reason the ends of 3' ends of genes and pseudogenes have more sequence similarity in mouse than in human for some reason.