Quantification of non-uniquely projected transcript alignments when using Salmon in alignment-based mode

42 views

Skip to first unread message

R. El-Athman

unread,

Jan 9, 2019, 8:45:45 AM1/9/19

to Sailfish Users Group

Hi,

I have a question concerning the use of Salmon in alignment-based mode and the quantification of transcripts that share the same exon regions. I have used STAR (with the option --quantMode TranscriptomeSAM) to map genes to the genome and then “project” them to transcriptome coordinates. In the case of a genomic alignment being mapped to several transcriptomic coordinates, it is projected to all of them, resulting in one genomic alignment being converted to as many transcriptomic alignments as needed. Now I was wondering about the following questions:

1) How does the Salmon alignment-based mode treat these alignments for transcript quantification? Is there a reason why all alignments for the same read should appear consecutively in the input alignment file?

2) How does this affect the summarization of transcript TPM counts to gene-level counts when using tximport with the txOut=FALSE option?

In short, does the combined use of STAR and Salmon (in alignment-based mode) lead to genes having a higher expression due to many transcripts sharing the same exon(s)/genome coordinates (and thus read alignments being projected to several transcripts) or is there a way to control for this?

Rob

unread,

Jan 9, 2019, 10:10:20 AM1/9/19

to Sailfish Users Group

Hi,

Welcome to the group!

On Wednesday, January 9, 2019 at 8:45:45 AM UTC-5, R. El-Athman wrote:

Hi,

I have a question concerning the use of Salmon in alignment-based mode and the quantification of transcripts that share the same exon regions. I have used STAR (with the option --quantMode TranscriptomeSAM) to map genes to the genome and then “project” them to transcriptome coordinates. In the case of a genomic alignment being mapped to several transcriptomic coordinates, it is projected to all of them, resulting in one genomic alignment being converted to as many transcriptomic alignments as needed. Now I was wondering about the following questions:

1) How does the Salmon alignment-based mode treat these alignments for transcript quantification? Is there a reason why all alignments for the same read should appear consecutively in the input alignment file?

Salmon will consider all of these distinct alignments for the fragment. One of the main goals of salmon is to model and resolve multimapping reads, and so this is precisely the intended use case.

When you feed the output of STAR (projecting alignments to the transcriptome), salmon will consider all of the alignment positions of each read, and attempt to allocate them (probabilistically) in the manner that maximizes the joint likelihood of all of the observed data. When feeding alignments to salmon, it is crucial that the records for a read appear consecutively, and that the two records for the ends of a paired end mapping are adjacent in the input file (this is the same requirement made by RSEM). This is because the SAM/BAM parser in salmon assumes that it will see all of the alignments for a single read together in a group, and that alignments for the same end of a fragment will be consecutive. If these assumptions were not made, then parsing would become highly inefficient in terms of time and memory, since you could basically have to hold the entire BAM file in memory before you see the relevant alignments for a read.

2) How does this affect the summarization of transcript TPM counts to gene-level counts when using tximport with the txOut=FALSE option?

In short, does the combined use of STAR and Salmon (in alignment-based mode) lead to genes having a higher expression due to many transcripts sharing the same exon(s)/genome coordinates (and thus read alignments being projected to several transcripts) or is there a way to control for this?

This use case is fine, and is one of the intended use cases. Salmon does not "double-count" the alignments to multiple transcripts in any way, so it's not as though you will have more "read mass" after quantifying with salmon than if you were simply counting the reads. However, it will attempt to optimally allocate each read taking into consideration all of the observed alignments. It is possible that if you quantify with salmon versus a simple counting based approach, you may find higher abundance for some genes. One primary mechanism for this is that counting based approaches discard reads that map between multiple genes (e.g. paralogs), while salmon will model this and attempt to allocate the fragments correctly. However, the fact that a single input read may map to one genomic position but many transcriptomic positions is not at all a problem, as salmon was designed to solve just this issue (assuming the input SAM/BAM is in the correct format as described above).

Best,

Rob

Reply all

Reply to author

Forward

0 new messages