parameters changes for very short read case

1,124 views
Skip to first unread message

Matthew

unread,
Sep 25, 2014, 5:30:56 PM9/25/14
to
Dear Alex,

Thank you for being so active and attentive in maintaining this help forum.  It really makes STAR worth using.


I find myself in one of the "corners" of alignment issues:   all of my reads are very short:    ranging from  20 to 30  in length
Also, sometimes I align against large genomes (e.g. human), but sometimes against smaller ones,  e.g. specific genes/operons/families of genes.  Also, families of short genes (length < 100)
Note, all my reads are single-end reads.

Many of my reads get dumped as "unmapped - too short".  Sometimes nearly all of them are dumped as such.

I have been trying to compile a list of STAR parameters that I should change, and want to hear your comments or any suggesitonf about ones I've left out:

Genome Index step:

  • --genomeSAindexNbases     set  o  min(14,log2(ReferenceLength)/2 - 1), as you suggested  here.
      (Reference length = sum of all chromosome lengths, right?  Or is it, say, length of shortest chromosome - a concern of mine because sometimes my reference "chromosomes" are e.g. length 75 genes).
  • --sjdbOverhang       how do I best set this if my reads vary in length (20-30), but are very short?

Alignment step:

  • --seedSearchStartLmaxOverLread      set to "0.50"  - since my reads are so short (e.g. 25), I want the seed to come from the middle of the read , rather than at position "50" as given by default by the parameter "seedSearchStartLmax"
  • --seedSearchStartLmax       should I set this to, say, 10,  since my reads are ~25 bp long?  The default is 50, which is longer than all of my reads.
  • should I set   "seedSearchStartLmax"     or  "seedSearchStartLmaxOverLread" ?    should i do both?
  • --outFilterMultimapNmax     increased to a very large number (hundreds, even a thousand)  since very short reads are prone to having many mapping locations when the reference is large.
  • --seedPerWindowNmax       should I increase this if I have a short reference and expect many reads to multimap on it?
  • --winBinNbits      should I decrease this when  aligning reads against a very small reference (< 1 kbp) where I expect the reads to multimap many times along that short reference?

Are there any other key parameters that I am missing that would get more of my many very-short-reads to map?


Thank you,
Matthew

Alexander Dobin

unread,
Sep 30, 2014, 12:34:29 PM9/30/14
to rna-...@googlegroups.com
Hi Matthew,

please find my answers below in blue.

Cheers
Alex
  • --genomeSAindexNbases     set  o  min(14,log2(ReferenceLength)/2 - 1), as you suggested  here.
      (Reference length = sum of all chromosome lengths, right?  Or is it, say, length of shortest chromosome - a concern of mine because sometimes my reference "chromosomes" are e.g. length 75 genes).
reference length = sum of all chromosome lengths
  • --sjdbOverhang       how do I best set this if my reads vary in length (20-30), but are very short?
the best value sjdbOverhang=maxReadLength-1. If you have rare long reads, you can use something like 90-percentile.

  • --seedSearchStartLmaxOverLread      set to "0.50"  - since my reads are so short (e.g. 25), I want the seed to come from the middle of the read , rather than at position "50" as given by default by the parameter "seedSearchStartLmax"
--seedSearchStartLmax works as follows - STAR will "split" the reads into pieces of equal size, but not longer than seedSearchStartLmax. With seedSearchStartLmaxOverLread=0.5 you will split each read in half. 
  • --seedSearchStartLmax       should I set this to, say, 10,  since my reads are ~25 bp long?  The default is 50, which is longer than all of my reads.
If you --seedSearchStartLmax 10, this will split each read into pieces no longer than 10. I think this results in a more "equalized" mapping accuracy for reads of different lengths.
  • should I set   "seedSearchStartLmax"     or  "seedSearchStartLmaxOverLread" ?    should i do both?
If you set both options, the shorted values for each read will be used.

  • --outFilterMultimapNmax     increased to a very large number (hundreds, even a thousand)  since very short reads are prone to having many mapping locations when the reference is large.
This is a filtering parameter, and it serves mostly to limit the file size in case of multi-mappers mapping to too many loci. If you are planning to use such multi-mappers, you need to increase this number. There is another parameter that controls STAR's ability to detect multi-mappers, --winAnchorMultimapNmax = 50 by default, you will need to increase it to the number of multi-mapping locations you want to output. This may significantly slow down the mapping.
  • --seedPerWindowNmax       should I increase this if I have a short reference and expect many reads to multimap on it?
This does not have to be increase because of multi-mappers, since in each window only different seeds of a read have tobe stitched.
  • --winBinNbits      should I decrease this when  aligning reads against a very small reference (< 1 kbp) where I expect the reads to multimap many times along that short reference?
I would recommend using --alignIntronMax to define the maximum intron size allowed. this will automatically set winBinNbits, winAnchorDistNbins, winFlankNbins which all define the windows sizes.
Reply all
Reply to author
Forward
0 new messages