Salmon tool giving an error about the library type

vk

unread,

May 25, 2018, 7:04:16 AM5/25/18

to Sailfish Users Group

Hi,

I have RNA-seq samples (human organism) generated through ribo depletion kit. Initially I checked the library type of the samples using RSEQc. It is reverse forward. So, in the alignment with `Hisat2` I used `--rna-strandness RF` which is `-fr-firststrand` in `Tophat`.

I'm trying to use `salmon` on the same samples with library type `-l ISR` based on their manual [salmon librarytype][1]

This is the command I used:

salmon quant -i index/ -l ISR -1 AT.1.fastq.gz -2 AT.2.fastq.gz -o transcripts_quant

When I checked the output file with mapping information I see like following in the end of the file:

ESC[1m[2018-05-23 23:02:18.809] [jointLog] [info] Computed 333657 rich equivalence classes for further processing

ESC[00mESC[1m[2018-05-23 23:02:18.809] [jointLog] [info] Counted 27089612 total reads in the equivalence classes

ESC[00mESC[33mESC[1m[2018-05-23 23:02:18.823] [jointLog] [warning] 0.0175308% of fragments were shorter than the k used to build the index (31).

If this fraction is too large, consider re-building the index with a smaller k.

The minimum read size found was 20.

ESC[00mESC[1m[2018-05-23 23:02:18.823] [jointLog] [info] Mapping rate = 28.9152%

ESC[00mESC[1m[2018-05-23 23:02:18.823] [jointLog] [info] finished quantifyLibrary()

ESC[00mESC[1m[2018-05-23 23:02:18.825] [jointLog] [info] Starting optimizer

ESC[00mESC[1m[2018-05-23 23:02:24.405] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate

ESC[00mESC[1m[2018-05-23 23:02:24.423] [jointLog] [info] iteration = 0 | max rel diff. = 48.1542

ESC[00mESC[1m[2018-05-23 23:02:25.913] [jointLog] [info] iteration = 100 | max rel diff. = 0.0934775

ESC[00mESC[1m[2018-05-23 23:02:27.400] [jointLog] [info] iteration = 200 | max rel diff. = 0.0553936

ESC[00mESC[1m[2018-05-23 23:02:28.846] [jointLog] [info] iteration = 300 | max rel diff. = 0.0348972

ESC[00mESC[1m[2018-05-23 23:02:30.357] [jointLog] [info] iteration = 400 | max rel diff. = 0.0276639

ESC[00mESC[1m[2018-05-23 23:02:31.834] [jointLog] [info] iteration = 500 | max rel diff. = 0.0228071

ESC[00mESC[1m[2018-05-23 23:02:33.341] [jointLog] [info] iteration = 600 | max rel diff. = 0.0191266

ESC[00mESC[1m[2018-05-23 23:02:34.779] [jointLog] [info] iteration = 700 | max rel diff. = 0.0171199

ESC[00mESC[1m[2018-05-23 23:02:36.308] [jointLog] [info] iteration = 800 | max rel diff. = 0.0134323

ESC[00mESC[1m[2018-05-23 23:02:37.754] [jointLog] [info] iteration = 900 | max rel diff. = 0.0129089

ESC[00mESC[1m[2018-05-23 23:02:39.248] [jointLog] [info] iteration = 1000 | max rel diff. = 0.0108738

ESC[00mESC[1m[2018-05-23 23:02:40.756] [jointLog] [info] iteration = 1100 | max rel diff. = 0.010454

ESC[00mESC[1m[2018-05-23 23:02:41.058] [jointLog] [info] iteration = 1122 | max rel diff. = 0.00969727

ESC[00mESC[1m[2018-05-23 23:02:41.080] [jointLog] [info] Finished optimizer

ESC[00mESC[1m[2018-05-23 23:02:41.080] [jointLog] [info] writing output

ESC[00mESC[33mESC[1m[2018-05-23 23:02:41.518] [jointLog] [warning] NOTE: Read Lib [( AT.1.fastq.gz, AT.2.fastq.gz )] :

Greater than 5% of the fragments disagreed with the provided library type; check the file: transcripts_quant/lib_format_counts.json for details

As you see in the end it is saying `Greater than 5% of the fragments disagreed with the provided library type` Then I also looked into `lib_format_counts.json` file.

This is what I saw in .json file:

{

"read_files": "( AT.1.fastq.gz, AT.2.fastq.gz )",

"expected_format": "ISR",

"compatible_fragment_ratio": 0.8183487087227385,

"num_compatible_fragments": 22168749,

"num_assigned_fragments": 27089612,

"num_consistent_mappings": 83629771,

"num_inconsistent_mappings": 11396640,

"MSF": 0,

"OSF": 27232,

"ISF": 4075759,

"MSR": 0,

"OSR": 73061,

"ISR": 83629771,

"SF": 2794681,

"SR": 4423463,

"MU": 0,

"OU": 0,

"IU": 0,

"U": 0

}

[1]: http://salmon.readthedocs.io/en/latest/library_type.html

1) What is the problem here with library type?

2) The overall alignment rate for this sample with hisat2 is 91% and here I see mapping rate is 28%. Why is that difference?

Rob

unread,

May 25, 2018, 11:23:08 AM5/25/18

to Sailfish Users Group

Hi vk,

Let me try and answer these in order.

1) The "problem" with the library type is that the specificity is quite low. Generally, with a stranded protocol, one expects ~92-100% of the fragments to be drawn according to the specified library type (the protocol is rarely 100% effective). However, in your case, the specificity seems to be even lower at ~82%. This isn't necessarily a fatal problem, but could be indicative of something awry with the library. It's also worth noting that this statistic is computed over all mappings rather than all fragments, so if there are some fragments that have a lot of mappings contrary to the library type, that could skew this statistic. However, one would expect that effect to be small.

2) There are a couple of reasons you might see this huge difference in mapping rate. The most likely source is that hisat2 is mapping to the genome while salmon is mapping to the transcriptome. Thus, if you have reads coming from intronic regions or unannotated isoforms, you would see them show up when mapping to the genome but not the transcriptome. The second (but much less likely) reason could be mapping stringency. How long are the reads? Salmon's default k-mer size (which is the minimum allowable match size) is 31. We recommend that this value be less than half the read length. Here, I strongly suspect the former reason to the latter. Even if hisat maps 91% of the reads, which fraction actually end up in annotated transcripts?

Best,

Rob

vk

unread,

May 27, 2018, 5:55:31 AM5/27/18

to Sailfish Users Group

Hi Rob,

Thanks a lot for the reply. For the sample I used I see that 15173292 read counts from the featureCounts.

vk

unread,

May 27, 2018, 5:56:34 AM5/27/18

to Sailfish Users Group

Also could you please tell me whether salmon can be used for mapping reads to genome. if not why?

On Friday, 25 May 2018 17:23:08 UTC+2, Rob wrote:

Rob

unread,

May 27, 2018, 11:08:08 AM5/27/18

to Sailfish Users Group

Hi vk,

So, it looks like salmon actually assigns more reads (27089612) than featureCounts (15173292) here --- though both assign a low fraction of the overall library. This suggests that you have a lot of reads mapping to genomic loci that are not part of the processed (i.e., fully-spliced) transcriptome. Depending on your exact experimental protocol, this seems like a sample quality issue to me.

Regarding your second question. Salmon does not currently map against the genome. The reason for this is that it treats each entry in the fasta file that is indexed as a separate quantification target. If you indexed the genome, it would give quantification estimates of the different chromosomes and contigs (rather than the genes and transcripts located on them), which is not very useful for your purposes (but which might be useful for people doing e.g. metagenomics / microbiomics). Further, the mapping algorithm is really geared toward the transcriptome, though we are working on a new indexing scheme that can handle mapping to the genome (but reads would still need to fall within annotated genes / transcripts to be quantified properly). If you really wish to map against the genome directly, you can accomplish this in salmon's alignment-based mode. Basically, you could use STAR to align to the genome. STAR has a mode that will "project" genomic mappings onto transcriptomic coordinates. You can feed the resulting BAM file to salmon's alignment-based mode to quantify using these alignments. However, given the featureCount results you shared above, I would still imagine that most of the reads here are not mapping to annotated features. So it's probably best now to dig into exactly where those reads are going. Are they coming from valid features that just aren't in your annotation (e.g. noncoding RNAs), or are they coming from introns / intergenic sequence?

Best,

Rob

vk

unread,

Oct 17, 2018, 5:40:33 AM10/17/18

to Sailfish Users Group

Hi Rob,

Sorry for Asking you again same question after few months.

I used salmon for TCGA data.

In the lib_format_counts.json file I saw like below:

I gave the command like this: salmon quant -i Homo_sapiens.GRCh38.92_quasi_index/ -l ISR -1 TCGA-BH-A1F6-01A.1_1.fastq -2 TCGA-BH-A1F6-01A.1_2.fastq -o transcripts_quant

{

"read_files": "( ../TCGA-BH-A1F6-01A.1_1.fastq, ../TCGA-BH-A1F6-01A.1_2.fastq )",

"expected_format": "ISR",

"compatible_fragment_ratio": 0.4684520123657979,

"num_compatible_fragments": 24034413,

"num_assigned_fragments": 51306030,

"num_consistent_mappings": 106918336,

"num_inconsistent_mappings": 125574795,

"MSF": 0,

"OSF": 85442,

"ISF": 106435389,

"MSR": 0,

"OSR": 83861,

"ISR": 106918336,

"SF": 8608401,

"SR": 8609949,

"MU": 0,

"OU": 0,

"IU": 0,

"U": 0

}

Last time you said me this:

The "problem" with the library type is that the specificity is quite low. Generally, with a stranded protocol, one expects ~92-100% of the fragments to be drawn according to the specified library type (the protocol is rarely 100% effective)

I'm a bit confused. Now with the above result from .json file what I can say? How do you say the specificity? thanq

Rob

unread,

Oct 17, 2018, 9:21:13 AM10/17/18

to Sailfish Users Group

Hi vk,

This looks suspiciously like an unstranded sample. The compatible fragment ratio means ~47% of the reads mapped as ISR. If you look at all of the mapping types, an almost equal number map as ISF (the rest of the mapped reads map as orphans and a tiny fraction map as OSF/OSR --- which are likely dovetailed reads). This means that there is almost no strand bias in the mapping; suggesting that this sample was sequenced using an unstranded protocol.

--Rob

Reply all

Reply to author

Forward