Pros and cons of increasing the default kmer size in Salmon

James Lloyd

unread,

Mar 16, 2016, 12:21:52 PM3/16/16

to Sailfish Users Group

Hello,

I was wondering what where the pros and cons of changing the default kmer size (31) in Salmon. I was thinking of increasing the kmer size. I am using 75nt PE reads for this experiment. It is basically a time-course and I got an odd result with both kallisto and Salmon. For one gene, I have 3 transcripts. Isoform A starts low and goes up, Isoform B, starts high and goes down. Isoform C starts high, goes down then up. Based on the biology I would expect isoform C to act like B and go down and not come up again so I want to check if this is real. I am mapping the reads and will look at the junction reads. Isoform C is very much like B but less of a difference from A than B does (same exon included but AD site making it smaller).

So my thinking was that the tools were perhaps having a difficult time correctly assigning reads to the right transcript as so few reads would be informative (hence why I want to look at junction reads). As both kallisto and Salmon do not have information about where the junctions are, I wondered if increasing kmer size would help more reads correctly assign to the right transcript isoform and might remove this problem. If you think this is worth trying, to what length would you advise changing the kmer size to? 40? 50?

All the best,

James

Rob

unread,

Mar 16, 2016, 12:55:08 PM3/16/16

to Sailfish Users Group

Hi James,

Well, if you're using the quasi-index, then 31 is actually the maximum k-mer size that is possible (this is because k must be odd, and we represent hashed k-mers in a 64-bit machine word). However, I doubt that changing the k-mer size would actually have much of an effect. This is because Salmon (and, actually, Sailfish now as well) doesn't simply count k-mers from reads. Rather, it makes use of quasi-mapping, which attempts to map the entire read to the transcriptome. The k-mers, in fact, just act as quick anchors to help whittle down possible mapping loci in the transcriptome. Thus, if there is any location in the 75bp read that overlaps a splice junction, then Salmon should map the read to the corresponding transcript(s) containing that junction. Further, if you're dealing with paired-end data, it will look at alignments for the entire fragment (i.e. read pair), so that the fragments should map to the subset of the transcriptome where both ends of the paired-end read map.

If you think that the unexpected results could be an artifact of the mapping procedure, one option would be to align your reads to the transcriptome (using e.g. HISAT2 or STAR) and then running alignment-based salmon on the resulting BAM file. The concordance between the quantification results might give you some notion of how much (if any) of the discrepancy between what you would expect and what you see could be due to the mapping vs. the inherent difficulty of the optimization problem. You might also try generating some bootstrap samples (`--numBootstraps`) to assess the variance in the abundance estimates generated for isoforms B and C (if the inference procedure is having substantial difficulty resolving multimapping between these isoforms, you'd expect to see substantial variance in the bootstrap samples).

I'd be very interested to hear what you find. While there typically can be small differences between mapping and alignment-based modes, we find that the quantification estimates are usually highly concordant.

Best,

Rob

James Lloyd

unread,

Mar 16, 2016, 1:12:33 PM3/16/16

to Sailfish Users Group

Hi Rob,

Thanks for the speedy reply! This is good to know. I have STAR mapped reads for these so I can try the alignment-based salmon and see if there is a difference, but this insight into how Salmon works is useful. I have run with bootstraps so that is something to check (I have been having problems with sleuth_live, which is how I normally check the variance).

Thanks again,

James

James Lloyd

unread,

Mar 16, 2016, 1:48:42 PM3/16/16

to Sailfish Users Group

I just realised that all my BAM files are to the genome so it will take me some time to work out which workaround to use. The Genomic vs. Transcriptomic alignments in the Documentation is a very useful place to start!

Rob

unread,

Mar 16, 2016, 2:13:22 PM3/16/16

to Sailfish Users Group

Ahh yes; that's a common issue when moving over from traditional "counting" pipelines. In HISAT2, you can build an index and align to the txome with the `--no-spliced-alignment` option. With STAR, there is an option (I forget the specific name) to "project" genomic alignments onto transcriptomic coordinates, which I've noticed to work well with Salmon in the past.

--Rob

James Lloyd

unread,

Mar 21, 2016, 12:34:46 AM3/21/16

to Sailfish Users Group

Hi Rob,

So I got an unexpected result from this. I used STAR mapped reads and the TPMs from this do not correlate as well with my lightweight alignment Salmon run as I expected. The spearman r=0.81. This is in contrast to the spearman r=0.97 when I compare the TPMs from Salmon and Kallisto. I could send figures of the scatter plots if you would like to see. Is this what you would expect or do you think something strange occurred? But from this data, I see I get the same pattern of expression for all my genes as I did before so that increases the chances there is real biology there. Below are some details about how I got the above results.

The transcriptome I am using comes from a custom (Cuffmerge) GTF I have. I made a transcriptome.fasta using this, genome.fasta and a Cufflinks tool (gffread). I used this to make the kallisto and salmon indexes.

I re-mapped my reads using STAR with the option --quantMode TranscriptomeSAM (not using --quantTranscriptomeBan Singleend). Here STAR maps to the genome but projects that onto the transcriptome. When I ran Salmon, I used the transcriptome.fasta from above that I had made the index from.

All the best,

James

Rob

unread,

Mar 21, 2016, 2:33:13 PM3/21/16

to Sailfish Users Group

Hi James,

That's very interesting. I would, indeed, be curious to see the correlation (scatter) plots (feel free to e-mail these privately if you wish, or we can continue the analysis here if you prefer). One more interesting analysis for which I'd be very interested to see the results

is the following: align the reads directly to the transcriptome using STAR / Bowtie2 / etc., and follow this with alignment-based Salmon. Here, the goal would be to see what fraction of variability is due to alignment vs quasimapping (for Salmon) or pseudoalignment (for Kallisto),

vs. the fact that the alignment with STAR was originally done with respect to the genome and then "projected" onto the transcriptome. I'm not sure of all of the details of how STAR handles e.g. reads that stretch into introns and the like when projecting genomic alignments

onto the transcriptome. The manner in which you have prepared the transcriptome itself for Salmon and kallisto sounds perfectly cromulent, and I imagine you could use this same transcriptome to do the test I suggest above with STAR / Bowtie2 etc. Also, I'm glad to hear that the

specific gene you mentioned in the first post exhibits the same behavior under these different pipelines. That does seem to provide evidence that what you're seeing (but didn't really expect) may be the result of real Biology rather than an artifact of inference.