Including GTF during index generation vs on-the-fly to reduce pseudogene mapping

Suraj Kannan

unread,

Jan 23, 2020, 6:36:12 PM1/23/20

to rna-star

I am working with multiple 3' scRNA-seq datasets, where the cDNA read length varies from 66 bp to 150 bp. I am using STAR through a program called zUMIs, which performs STAR two-pass mapping followed by several downstream steps to count and collapse UMIs. Notably, zUMIs does not use the GTF during the index generation stage but rather provides the GTF (as well as --sjdbOverhang 'readLength' - 1) during the mapping stage to provide splice junctions on the fly - this is done so that the same index can be used for many different datasets as necessary. I find that regardless of the length of my cDNA read, my downstream data is full of pseudogenes. Specifically, I am working with a mitochondrially rich celltype, and part of my analysis actually involves quantifying these mitochondrial reads across datasets. I find that downstream, my data has high abundance of both mitochondrial reads and mitochondrial pseudogene reads. The percentage of reads diverted to canonical mitochondrial reads vs pseudogenes is relatively consistent across cells within the same dataset but varies from dataset to dataset. There are several other pieces of evidence that make us think that it is unlikely that our cells are actually expressing mitochondrial pseudogenes, so I am trying to see if this is something that is affected by mapping parameters. Would this change significantly if we provide the GTF during the index generation stage rather than during mapping? Or are there other parameters within mapping that I should consider exploring?

I am happy to provide the output log files for multiple datasets if helpful. (Also apologies for repost - I accidentally deleted the earlier post).

Alexander Dobin

unread,

Jan 25, 2020, 5:05:38 PM1/25/20

to rna-star

Hi Suraj,

I think mitochondrial genes are not spliced, are they?

Hence it's unlikely GTF would have any effect on it.

Moreover, supplying GTF at the mapping stage with the ideal --sjdbOverhang = ReadLength-1 is actually marginally better.

It might be a good idea to check the Log.out and Log.final.out files if it worked properly.

I would look for another reason for seeing mitochondrial pseudogenes.

Does zUMI count only unique mapppers?

I imagine mapping a large number of reads would map as multimappers to both M.pseudogenes and M.genes.

The multimappers should be counted.

What is the ratio of the number of M.-pseudogene reads to M.-gene reads?

if a large number of reads originates in M.genes, some proportion of the could acquire errors (RT, PCR, sequencing) and - by accident - map uniquely to M.pseudogenes instead of M.genes.

Cheers

Alex

Suraj Kannan

unread,

Jan 25, 2020, 7:06:50 PM1/25/20

to rna-star

Ah - I didn't think about splicing sites in mitochondrial genes. I think you are right, they don't appear to have introns. (I was assuming that, since they are transcribed as polycistronic transcripts and then processed to individual RNAs, there might be other processing as well).

zUMIs sets outSAMmultNmax to 1 to report the primary alignment, and then counts primary alignments for all reads (in featureCounts), including multimappers. As you suggested, the multimappers are enriched in reads whose primary alignments are pseudogenes. Moreover, taking reads with unique alignments only produces a very high M.gene to M.pseudogene ratio (i.e. there are many M.gene unique mappers but not many M.pseudogene unique mappers). However, the problem is that, including multimappers, the ratio of M.pseudogenes to M.genes is high (this has varied study to study but sometimes as high as 2:1). I suppose from the mapping end, the issue is why the primary alignments the pseudogene rather than the canonical gene, but I'm guessing that if there are multiple equally good alignments STAR selects randomly? (Also your point about possible PCR errors is fair).

Alexander Dobin

unread,

Feb 3, 2020, 6:28:59 PM2/3/20

to rna-star

Hi Suraj,

does zUMI use --outMultimapperOrder Random?

This is required if primary alignments (--outSAMmultNmax 1) are used for counting because otherwise the primary alignment is not assigned randomly and you can get a bias toward certain loci.