Dear Alex,
Thank you for being so active and attentive in maintaining this help forum. It really makes STAR worth using.
I find myself in one of the "corners" of alignment issues: all of my reads are very short: ranging from 20 to 30 in length
Also, sometimes I align against large genomes (e.g. human), but sometimes against smaller ones, e.g. specific genes/operons/families of genes. Also, families of short genes (length < 100)
Note, all my reads are single-end reads.
Many of my reads get dumped as "unmapped - too short". Sometimes nearly all of them are dumped as such.
I have been trying to compile a list of STAR parameters that I should change, and want to hear your comments or any suggesitonf about ones I've left out:
Genome Index step:
- --genomeSAindexNbases set o min(14,log2(ReferenceLength)/2 - 1), as you suggested here.
(Reference length = sum of all chromosome lengths, right? Or is it, say, length of shortest chromosome - a concern of mine because sometimes my reference "chromosomes" are e.g. length 75 genes). - --sjdbOverhang how do I best set this if my reads vary in length (20-30), but are very short?
Alignment step:
- --seedSearchStartLmaxOverLread set to "0.50" - since my reads are so short (e.g. 25), I want the seed to come from the middle of the read , rather than at position "50" as given by default by the parameter "seedSearchStartLmax"
- --seedSearchStartLmax should I set this to, say, 10, since my reads are ~25 bp long? The default is 50, which is longer than all of my reads.
- should I set "seedSearchStartLmax" or "seedSearchStartLmaxOverLread" ? should i do both?
- --outFilterMultimapNmax increased to a very large number (hundreds, even a thousand) since very short reads are prone to having many mapping locations when the reference is large.
- --seedPerWindowNmax should I increase this if I have a short reference and expect many reads to multimap on it?
- --winBinNbits should I decrease this when aligning reads against a very small reference (< 1 kbp) where I expect the reads to multimap many times along that short reference?
Are there any other key parameters that I am missing that would get more of my many very-short-reads to map?
Thank you,
Matthew