ambiguously mapped reads counted once or multiple times

Chen Lingyun

unread,

Feb 9, 2019, 11:54:00 PM2/9/19

to Sailfish Users Group

Hello everyone,

I like Salmon. I have some questions about counting the reads.

When mapping reads to transcripts, the uniquely mapped reads should be counted one time, right? For ambiguously mapped reads, are they counted one time or multiple times? In other words, if a read can be ambiguously mapped to both transcript A and transcript B, the read was counted in either transcript A or transcript B? Alternatively, the read was counted two times, both A and B?

Is it possible to use only the uniquely mapped reads for estimating the TPM with Salmon?

Thank you so much.

Best regards,

Lingyun

Rob

unread,

Feb 10, 2019, 9:59:33 AM2/10/19

to Sailfish Users Group

Hi Lingyun,

Thanks for using salmon. To answer your question directly, reads that map ambiguously are only counted once, but they are allocated probabilistically among the targets to which they map. So, it is actually neither of the cases you mention above --- if a read maps to both A and B then it will be counted toward A and B, but probabilistically, so that the total sum of read counts is preserved. The actual probabilities with which the read is allocated to A and B is determined via salmon's variational Bayesian optimization procedure, and depends on structure of the read mappings of all other reads, and other learned parameters of the experiment (e.g., the fragment length distribution). That is, if salmon is able to map 10 million reads (regardless of the number of alignments per read), then the total allocated read count (the sum of the NumReads) column will be 10 million. In fact, accurately allocating multi-mapping reads is one of the main purposes of salmon.

Best,

Rob

Chen Lingyun

unread,

Feb 10, 2019, 9:23:13 PM2/10/19

to Sailfish Users Group

Dear Rob,

Thanks for your reply.

Best,

Lingyun

Nico Palaskas

unread,

Mar 13, 2019, 3:28:12 PM3/13/19

to Sailfish Users Group

In response to Rob's comments, I have a follow up question:

I have used Salmon to quantify, tximport for lengthScaledTPM, and limma for differential expression. The RNA seq data is from the same cell line, with RNA isolated from confluent and thinly plated cells (contrast of interest) in triplicates.

One of my differentially expressed hits is HIST2H4B, histone cluster 2 H4B. HIST2H4A is an identical sequence, not meeting statistical significance with multiple hypothesis testing, but with the opposite trend. Here, all mapping would be ipso facto ambiguous, right?

Based on what you said above, I am guessing that my result is a reflection that the overall data structure between the two groups is different and that the algorithm has made more assignments of HIST2H4B based on some "learned parameters". Although the choice of H4B over H4A has to be random, right?

By extension, would there be a similar scenario for highly similar but non-identical sequences, where the "learned parameters" would systematically trump identity of reads?