Reducing k-mer length in Salmon for reads < 75 nt

840 views
Skip to first unread message

Thomas Sandmann

unread,
Oct 18, 2016, 4:58:42 PM10/18/16
to Sailfish Users Group
Hi,

I would like to reprocess older Illumina data with 50 bp single-end reads. The Salmon documentation states:

We find that a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads. 
 
  • How important is the k-mer size for shorter read length?
  • Do you have a recommendation for 50 nt long reads, or a rule thumb to get started?
  • Perhaps most importantly - how can I assess whether the k-mer length is suitable, e.g. are there quality metrics that would go off the rails if my choice is a bad one?

Thanks a lot for any input,
Thomas

Rob

unread,
Oct 19, 2016, 11:23:46 AM10/19/16
to Sailfish Users Group
Hi Thomas,

  When using salmon's default indexing scheme (i.e. "quasi") the k-mer size has the effect that each mapping of a read must be anchored by an exact match of length at least k.  The reason we recommend reducing the k-mer size for short reads is that, though most 50bp reads would likely be OK with a k-mer size of 31, it is possible for a single error (e.g. at base 25) to cause the entire read not to map (since there would be no exact match of size 31 anywhere in the read).  Though I can't offer a deterministic rule to choose the optimal k for short reads, a good rule of thumb to start with would be to ensure that k < l / 2, where l is the read length. That is, for 50 bp reads, I would choose a k-mer size < 25 (perhaps 21 or 23).  This ensures that even an adversarially placed error / mutation won't prevent mapping (true read errors are slightly less problematic since they tend to cluster at the end of reads).

  The effect you'll see if you use a k that is too long, is that the reported mapping rate will drop off.  So, one thing to check for is relative "stability" of the mapping rate around the k value you choose.  For example, I suspect you'd see very similar mapping rates at k=19,21,23 in your dataset, meaning that these are all probably acceptable k-mer size choices.  If you have any questions once you start processing some of this data, I'd be happy to answer them at that point.

Best,
Rob

Thomas Sandmann

unread,
Oct 19, 2016, 11:57:59 AM10/19/16
to Sailfish Users Group
Thanks a lot for your great explanation, Rob! That gives me a very useful starting point.
Thomas
Reply all
Reply to author
Forward
0 new messages