salmon questions

Runxuan zhang

unread,

Dec 11, 2014, 10:13:00 AM12/11/14

to sailfis...@googlegroups.com

Dear Rob,

I have been trying salmon in a read mode, here i have some questions on the information given by salmon.

here are the commands i used: (with and without mapping cache)

/mnt/apps/salmon/0.2.2/bin/salmon quant --disableMappingCache -i /home/rz41242/Projects/2014AS/raw/salmon_Index -l IU -1 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -2 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -o salmon_test

/mnt/apps/salmon/0.2.2/bin/salmon quant -i /home/rz41242/Projects/2014AS/raw/salmon_Index -l IU -1 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -2 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -o salmon_test1

Q1. There are some output from salmon regarding to alignment, I wonder if the info is relevant given i am running a read mode. Or does it mean that the read info i provided is not correct.

Salmon output "Greater than 5% of the alignments (but not, necessarily reads) disagreed with the provided library type; check the file: salmon_test/libFormatCounts.txt for details".

In /libFormatCounts.txt:

========

Read library consisting of files: ( /dev/fd/63, /dev/fd/62 )

Expected format: Library format { type:paired end, relative orientation:inward, strandedness:unstranded }

# of consistent alignments: 0

# of inconsistent alignments: 48801410

strand bias = -nan (0.5 is unbiased)

# alignments with format Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }: 0

# alignments with format Library format { type:paired end, relative orientation:inward, strandedness:(sense, antisense) }: 0

========

---- counts for each format type ---

Library format { type:single end, relative orientation:matching, strandedness:(sense, antisense) } : 0

Library format { type:paired end, relative orientation:matching, strandedness:(sense, antisense) } : 0

Library format { type:single end, relative orientation:outward, strandedness:(sense, antisense) } : 0

Library format { type:paired end, relative orientation:outward, strandedness:(sense, antisense) } : 0

Library format { type:single end, relative orientation:inward, strandedness:(sense, antisense) } : 0

Library format { type:paired end, relative orientation:inward, strandedness:(sense, antisense) } : 0

Library format { type:single end, relative orientation:none, strandedness:(sense, antisense) } : 0

Library format { type:paired end, relative orientation:none, strandedness:(sense, antisense) } : 0

Library format { type:single end, relative orientation:matching, strandedness:(antisense, sense) } : 0

Library format { type:paired end, relative orientation:matching, strandedness:(antisense, sense) } : 0

Library format { type:single end, relative orientation:outward, strandedness:(antisense, sense) } : 0

Library format { type:paired end, relative orientation:outward, strandedness:(antisense, sense) } : 0

Library format { type:single end, relative orientation:inward, strandedness:(antisense, sense) } : 0

Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) } : 0

Library format { type:single end, relative orientation:none, strandedness:(antisense, sense) } : 0

Library format { type:paired end, relative orientation:none, strandedness:(antisense, sense) } : 0

Library format { type:single end, relative orientation:matching, strandedness:sense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:sense } : 24382429

Library format { type:single end, relative orientation:matching, strandedness:sense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:sense } : 0

Library format { type:single end, relative orientation:matching, strandedness:sense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:sense } : 0

Library format { type:single end, relative orientation:matching, strandedness:sense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:sense } : 0

Library format { type:single end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:antisense } : 24418981

Library format { type:single end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:single end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:single end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:paired end, relative orientation:matching, strandedness:antisense } : 0

Library format { type:single end, relative orientation:matching, strandedness:unstranded } : 0

Library format { type:paired end, relative orientation:matching, strandedness:unstranded } : 0

Library format { type:single end, relative orientation:outward, strandedness:unstranded } : 0

Library format { type:paired end, relative orientation:outward, strandedness:unstranded } : 0

Library format { type:single end, relative orientation:inward, strandedness:unstranded } : 0

Library format { type:paired end, relative orientation:inward, strandedness:unstranded } : 0

Library format { type:single end, relative orientation:none, strandedness:unstranded } : 0

Library format { type:paired end, relative orientation:none, strandedness:unstranded } : 0

------------------------------------

Q2: I found the TPMs for the same library are different under with and without mapping cache. So which one is more accurate? Does mapping cache improved the speed in sacrifice of accuracy?

Q3: Does adaptor & quality trimming of the reads improve the accuracy/speed for quantification using salmon?

Q4: How can i obtain mapping ratios for salmon?

Q5: what does "processed fragments" and "hits per frag" means? I found the numbers varies enormaly under with and without mapping cache too!

Thank you very much!

Best regards,

Runxuan

Rob

unread,

Dec 11, 2014, 12:35:32 PM12/11/14

to sailfis...@googlegroups.com

On Thursday, December 11, 2014 10:13:00 AM UTC-5, Runxuan zhang wrote:

Dear Rob,

I have been trying salmon in a read mode, here i have some questions on the information given by salmon.

here are the commands i used: (with and without mapping cache)

/mnt/apps/salmon/0.2.2/bin/salmon quant --disableMappingCache -i /home/rz41242/Projects/2014AS/raw/salmon_Index -l IU -1 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -2 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -o salmon_test

/mnt/apps/salmon/0.2.2/bin/salmon quant -i /home/rz41242/Projects/2014AS/raw/salmon_Index -l IU -1 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -2 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -o salmon_test1

Hi Runxuan,

There are a few different things going on here. Below are my best explanations, but please let me know if anything remains unclear after I explain.

It appears that you're providing the read1 file twice instead of the read1 file and the read2 file:

-1 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz) -2 <(pigz -cd 805_LIB5949_LDI4765_ACAGTG_L002_R1.fastq.gz)

The message you're getting is because salmon detects that the expected mapping orientation (reads on opposite strands facing inward, as designated by "IU") is

not being met by a significant number of the reads. In fact, since you provided the read1 file twice, a read is always "paired" with itself, and so all of the reads in a

"pair" will map to the same strand, with an approximately equal number of "pairs" mapping to the sense and antisense strands. I suspect simply providing

the read2 file in it's place will dispense with this warning (and give you more accurate results).

Q2: I found the TPMs for the same library are different under with and without mapping cache. So which one is more accurate? Does mapping cache improved the speed in sacrifice of accuracy?

No, the mapping cache does not improve speed at the expense of accuracy. In fact, if you're reading from a regular file (i.e. not using input re-direction) or if your file contained more reads,

then the difference in TPM between using and not using the mapping cache should be minimal / non-existent (since, salmon, like all such isoform quantification methods, uses a statistical algorithm,

one naturally expects that there could be very small between-run fluctuations).

In your case, what is most-likely happening is the following. Salmon, being an online algorithm, wants to observe a certain number of reads before it's happy declaring "convergence". If it doesn't

see this number of observations in it's first pass through the file, it will make a second pass (and, if the dataset is small enough, a 3rd, 4th, etc.) This is exactly the process the mapping cache was

invented to speed up. Finding the set of "potential" loci of a read is a deterministic rather than stochastic process, and so it doesn't change between passes over the file. The mapping cache stores

these results so that they don't need to be recomputed on subsequent passes. However, if your reads are being given to salmon via input redirection / process-substitutions, they are coming from

a fifo rather than a regular file. Unlike a regular file, a fifo can only be read from once, and so subsequent passes cannot be performed. Salmon checks this condition and reports if it wants to look

through the reads again but they were provided via a fifo (in this case it prints a warning message and exits). So, in this situation, what I presume happened is that the quantification with the

mapping cache passed over the mappings twice, providing, potentially more accurate quantification, while the run without the mapping cache was not able to make a second pass over the file

and thus terminated abundance estimation after one pass.

Like I mentioned above, small fluctuations between runs is normal and is due to the statistical nature of the algorithm (and most other algorithms I know of for solving this problem). However,

I suspect that were you to decompress these files ahead of time and pass the decompressed files directly to salmon, then the runs with and without the mapping cache would provide more

similar estimates (since they would both be able to make the number of observations they want).

Q3: Does adaptor & quality trimming of the reads improve the accuracy/speed for quantification using salmon?

Adaptor trimming will likely help (and definitely will not hurt). The answer for quality trimming is less clear. This is actually an issue that the community is still,

debating in the context of "alignment-based" RNA-seq quantification (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0085024). Some argue

that quality trimming the reads results in a higher mapping rate, but actually diminishes accuracy. I expect that very gentle quality trimming would not hurt much

(it may also slightly improve speed), but aggressive quality trimming is probably a bad idea.

Q4: How can i obtain mapping ratios for salmon?

Actually, this is something that it makes sense for salmon to output at the end of it's run. Would you mind opening an issue (that I'll mark as a feature request) over

on the GitHub repo (https://github.com/kingsfordgroup/sailfish/issues)? This will remind me to implement this feature, which should be very easy, in the next

commit.

Currently, the easiest thing to do is probably just to sum up the output column with the estimated number of reads for each transcript. This is an assignment of

the read count to each transcript in a way that has been (probabilistically) disambiguated. So, the sum of entries in this column should be very close to the total

number of mapped reads (considering a single mapping for each read). If you divide this number by the total number of input reads, that should give you an

estimate of the mapping ratio.

Q5: what does "processed fragments" and "hits per frag" means? I found the numbers varies enormaly under with and without mapping cache too!

These are simply informative outputs to let one interactively monitor the progress of salmon. The "processed fragments" is the total number of reads that have

been looked at by salmon. The value is continuously updated as different threads map reads and estimate how they effect the abundances. So as to not

spam the output console (and since these stats are only meant to give an interactive notion of progress), the numbers are written to the same line of the console.

That is, as processing progresses, the previous "processed fragments" number is erased and the new one is written in its place. The same is true of

the "hits per frag" statistic, which is just the average number of potential loci that salmon is finding per-fragment. For example, if you had two fragments and

one of them potentially mapped to 10 places and the other mapped to 1, then "hits per frag" would be (10+1)/2 = 5.5.

The differences between these numbers between runs could be due to a few things. For the "processed fragments" number, with the mapping cache, I'd

expect you to observe many more fragments (probably close to twice the number). This is because, as explained above, the mapping cache let's salmon pass

over the set of mapping again to derive more accurate estimates, while, without the mapping cache (since the reads are not coming from a regular file), only a

single pass is possible. For the "hits per frag" number (which I'd expect to be more similar between the two runs than the "processed fragments" number), the

difference is likely due to normal variation in how that output was updated and written to screen (at a given point in time, this number is only an estimate, as reads

are constantly being consumed and mapped by multiple different threads).

Thank you very much!

I hope these responses addressed your questions. Thanks for trying out salmon, and let us know if you have any other questions (and also if it seems to be working

well for you)!

Best,

Rob

Best regards,

Runxuan

Runxuan zhang

unread,

Dec 12, 2014, 7:18:19 AM12/12/14

to sailfis...@googlegroups.com

Dear Rob,

Thank you very much for your prompt, detailed and clear explanations. Indeed salmon is working very well for me. For my current dataset even using a small kmer size (20), the mapped ratio is still below 50%. That probably means half of the reads have not been used.

I am looking forward to see how salmon performs in this aspect.

Thanks a lot,

Runxuan

Rob

unread,

Dec 12, 2014, 10:23:19 AM12/12/14

to sailfis...@googlegroups.com

Hi Runxuan,

Thanks for opening the issue on github. I hope to get to it either today or tomorrow. While I still think salmon may give you better results (specifically given your use of paired-end reads), it's important to note that a k-mer mapping rate of ~50% in Sailfish doesn't mean that ~50% of the *reads* haven't been used. Specifically, there is a critical question of where the non-matching k-mers occur and how much they overlap. For example, we can construct a "worst case" scenario where most of a read is covered by k-mers that map to the reference, yet most of the k-mers themselves don't map --- note, this is a highly synthetic example and the case is usually far from this adversarial in practice.

Imagine I have a k-mer size of 20 and a read of length 41. There are 41 - 20 + 1 = 21 potential k-mers in this read. Imagine the first k-mer maps, but at the 21st base (or, index 20 in the read since we've started counting from offset 0), there is a sequencing error. Now, the next 20 k-mers may not map to our reference. Finally, the last k-mer that begins at offset 21 and goes to offset 41 maps to our reference. In this case ~98% of the bases in our read have been covered by k-mers, but only ~5% of the k-mers have "mapped". Despite this, we've still obtained useful information about this read and it will contribute to quantification. Obviously, this is an adversarial and highly unlikely scenario, but it illustrates why we can't simply assume that a low k-mer mapping rate implies that reads are being used at the same rate. And, in fact, we've found that the k-mer mapping ratio is not necessarily a good indicator of how accurate the quantification estimates (e.g. TPM) are. Again, this being said, I still expect that salmon may do better in your scenario, and I hope to have your feature request complete relatively soon.

Best,
Rob

debating in the context of "alignment-based" RNA-seq quantification (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0085024 - private). Some argue

that quality trimming the reads results in a higher mapping rate, but actually diminishes accuracy. I expect that very gentle quality trimming would not hurt much
(it may also slightly improve speed), but aggressive quality trimming is probably a bad idea.

Q4: How can i obtain mapping ratios for salmon?

Actually, this is something that it makes sense for salmon to output at the end of it's run. Would you mind opening an issue (that I'll mark as a feature request) over

on the GitHub repo (https://github.com/kingsfordgroup/sailfish/issues - private)? This will remind me to implement this feature, which should be very easy, in the next

Runxuan zhang

unread,

Dec 12, 2014, 11:21:27 AM12/12/14

to sailfis...@googlegroups.com

Dear Rob,

Thanks a lot for your email. I see that the mapping ratio are actually the ratio of k-mers derived from the reads mapped to the reference versus all the k-mers derived from the reads.

In that sense, 50% is quite good then. Given my reads are 100nt long (<81 k-mers with k-mer size 20), 1 mismatch may cause up to 25% of k-mers not mapped. Complications such as AS events and post transcriptional regulation, incomplete of the database, etc, will have bigger impacts.

Of course it is easier said than done, but i was wondering besides the mapped ratio (of the k-mers) of all the reads, whether we could have more statistics on the coverage of each read by k-mers. (range from 0-100% for example), what percentage of the k-mers from this read uniquely pointing to a transcript. These data will help the interpretation of the results greatly when we try to look at individual isoforms.

Sorry to create more work for you. They are just some opinions for your thoughts.

Thanks a lot,

Runxuan

Runxuan zhang

unread,

Dec 17, 2014, 6:54:07 AM12/17/14

to sailfis...@googlegroups.com

Dear Rob?

Regarding to the calculation of the mapped ratio of the reads, for paired end reads, does one pair count as one read or two reads when i calculate the total number of reads in the library?

Thanks a lot,

Runxuan

Rob

unread,

Dec 17, 2014, 10:32:23 AM12/17/14

to sailfis...@googlegroups.com

On Wednesday, December 17, 2014 6:54:07 AM UTC-5, Runxuan zhang wrote:

Dear Rob?

Regarding to the calculation of the mapped ratio of the reads, for paired end reads, does one pair count as one read or two reads when i calculate the total number of reads in the library?

Hi Runxuan,

Salmon always performs quantification in terms of "fragments". Thus if you're passing the reads into salmon as paired-end (i.e. using a paired end library type), then it will count each pair as a single "fragment". So, to be consistent, when comparing against the total number of reads in the library, you should count the number of read pairs.