parameters changes for very short read case

1,124 views

Skip to first unread message

Matthew

unread,

Sep 25, 2014, 5:30:56 PM9/25/14

Dear Alex,

Thank you for being so active and attentive in maintaining this help forum. It really makes STAR worth using.

I find myself in one of the "corners" of alignment issues: all of my reads are very short: ranging from 20 to 30 in length

Also, sometimes I align against large genomes (e.g. human), but sometimes against smaller ones, e.g. specific genes/operons/families of genes. Also, families of short genes (length < 100)

Note, all my reads are single-end reads.

Many of my reads get dumped as "unmapped - too short". Sometimes nearly all of them are dumped as such.

I have been trying to compile a list of STAR parameters that I should change, and want to hear your comments or any suggesitonf about ones I've left out:

Genome Index step:

--genomeSAindexNbases set o min(14,log2(ReferenceLength)/2 - 1), as you suggested here.
(Reference length = sum of all chromosome lengths, right? Or is it, say, length of shortest chromosome - a concern of mine because sometimes my reference "chromosomes" are e.g. length 75 genes).
--sjdbOverhang how do I best set this if my reads vary in length (20-30), but are very short?

Alignment step:

--seedSearchStartLmaxOverLread set to "0.50" - since my reads are so short (e.g. 25), I want the seed to come from the middle of the read , rather than at position "50" as given by default by the parameter "seedSearchStartLmax"
--seedSearchStartLmax should I set this to, say, 10, since my reads are ~25 bp long? The default is 50, which is longer than all of my reads.
should I set "seedSearchStartLmax" or "seedSearchStartLmaxOverLread" ? should i do both?
--outFilterMultimapNmax increased to a very large number (hundreds, even a thousand) since very short reads are prone to having many mapping locations when the reference is large.
--seedPerWindowNmax should I increase this if I have a short reference and expect many reads to multimap on it?
--winBinNbits should I decrease this when aligning reads against a very small reference (< 1 kbp) where I expect the reads to multimap many times along that short reference?

Are there any other key parameters that I am missing that would get more of my many very-short-reads to map?

Thank you,

Matthew

Alexander Dobin

unread,

Sep 30, 2014, 12:34:29 PM9/30/14

to rna-...@googlegroups.com

Hi Matthew,

please find my answers below in blue.

Cheers

Alex

--genomeSAindexNbases set o min(14,log2(ReferenceLength)/2 - 1), as you suggested here.
(Reference length = sum of all chromosome lengths, right? Or is it, say, length of shortest chromosome - a concern of mine because sometimes my reference "chromosomes" are e.g. length 75 genes).

reference length = sum of all chromosome lengths

--sjdbOverhang how do I best set this if my reads vary in length (20-30), but are very short?

the best value sjdbOverhang=maxReadLength-1. If you have rare long reads, you can use something like 90-percentile.

--seedSearchStartLmaxOverLread set to "0.50" - since my reads are so short (e.g. 25), I want the seed to come from the middle of the read , rather than at position "50" as given by default by the parameter "seedSearchStartLmax"

--seedSearchStartLmax works as follows - STAR will "split" the reads into pieces of equal size, but not longer than seedSearchStartLmax. With seedSearchStartLmaxOverLread=0.5 you will split each read in half.

--seedSearchStartLmax should I set this to, say, 10, since my reads are ~25 bp long? The default is 50, which is longer than all of my reads.

If you --seedSearchStartLmax 10, this will split each read into pieces no longer than 10. I think this results in a more "equalized" mapping accuracy for reads of different lengths.

should I set "seedSearchStartLmax" or "seedSearchStartLmaxOverLread" ? should i do both?

If you set both options, the shorted values for each read will be used.

--outFilterMultimapNmax increased to a very large number (hundreds, even a thousand) since very short reads are prone to having many mapping locations when the reference is large.

This is a filtering parameter, and it serves mostly to limit the file size in case of multi-mappers mapping to too many loci. If you are planning to use such multi-mappers, you need to increase this number. There is another parameter that controls STAR's ability to detect multi-mappers, --winAnchorMultimapNmax = 50 by default, you will need to increase it to the number of multi-mapping locations you want to output. This may significantly slow down the mapping.

--seedPerWindowNmax should I increase this if I have a short reference and expect many reads to multimap on it?

This does not have to be increase because of multi-mappers, since in each window only different seeds of a read have tobe stitched.

--winBinNbits should I decrease this when aligning reads against a very small reference (< 1 kbp) where I expect the reads to multimap many times along that short reference?

I would recommend using --alignIntronMax to define the maximum intron size allowed. this will automatically set winBinNbits, winAnchorDistNbins, winFlankNbins which all define the windows sizes.

Reply all

Reply to author

Forward

0 new messages