Confused about sjdbOverhang

5,815 views
Skip to first unread message

David O'Brien

unread,
Aug 20, 2013, 2:27:30 PM8/20/13
to rna-...@googlegroups.com
I have questions about the --sjdbOverhang option. In the manual it says, ideally = (mate_length - 1). I take this to mean that if my read length is 100 bp, the optimal value for this option is 99 because I could have 99 base pairs mapped to one side of a junction and 1 base pair on the other side. So --sjdbOverhang would be the maximum allowable base pairs on one side of a splice junction? What if I trim the 3' end of my reads for quality leaving reads of varying length? I'm pretty sure I have this all wrong. What value should I put for paired-end reads when all reads are 100 bp? How about when the lengths vary? What exactly is meant by mate_length?

James Blachly

unread,
Aug 20, 2013, 10:49:01 PM8/20/13
to David O'Brien, rna-...@googlegroups.com
My understanding is that too large a value is better than too short.

sjdbOverhang too long: mapping less efficient / slower (marginally)
sjdbOverhang too short: mappings could be missed


On Aug 20, 2013, at 2:27 PM, David O'Brien <dunder...@gmail.com> wrote:

I have questions about the --sjdbOverhang option. In the manual it says, ideally = (mate_length - 1). I take this to mean that if my read length is 100 bp, the optimal value for this option is 99 because I could have 99 base pairs mapped to one side of a junction and 1 base pair on the other side. So --sjdbOverhang would be the maximum allowable base pairs on one side of a splice junction? What if I trim the 3' end of my reads for quality leaving reads of varying length? I'm pretty sure I have this all wrong. What value should I put for paired-end reads when all reads are 100 bp? How about when the lengths vary? What exactly is meant by mate_length?

--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Alexander Dobin

unread,
Aug 21, 2013, 11:19:58 AM8/21/13
to rna-...@googlegroups.com, David O'Brien
Hi David,

James is right, using large enough --sjdbOverhang is safer and should not generally cause any problems with reads of varying length.
If your reads are very short, <50b, then I would strongly recommend using optimum --sjdbOverhang=mateLength-1
By mate length I mean the length of one of the ends of the read, i.e. it's 100 for 2x100b PE or 1x100b SE.
For longer reads you can simply use generic --sjdbOverhang 100.

It is a bit confusing because of the way I named this parameter. --sjdbOverhang <Noverhang> is only used at the genome generation step  for constructing the reference sequence out of the annotations.
Basically, the Noverhang exonic bases from the donor site and Noverhang exonic bases from the acceptor site are spliced together for each of the junctions, and these spliced sequences are added to the genome sequence.

At the mapping stage, the reads are aligned to both genomic and splice sequences simultaneously. If a read maps to one of spliced sequences and crosses the "junction" in the middle of it, the coordinates of two pspliced pieces are translated back to genomic space and added to the collection of mapped pieces, which are then all "stitched" together to form the final alignment. Since in the process of "maximal mapped length" search the read is split into pieces of no longer than --seedSearchStartLmax (=50 by default) bases, even if the read (mate) is longer than --sjdbOverhang, it can still be mapped to the spliced reference, as long as --sjdbOverhang > --seedSearchStartLmax.

Cheers
Alex

grif...@gmail.com

unread,
Jan 27, 2014, 11:18:20 AM1/27/14
to rna-...@googlegroups.com, David O'Brien
I know this thread is a bit old, but it addresses a questions we've recently come across using STAR.

Alex, you mention that:

" Since in the process of "maximal mapped length" search the read is split into pieces of no longer than --seedSearchStartLmax (=50 by default) bases, even if the read (mate) is longer than --sjdbOverhang, it can still be mapped to the spliced reference, as long as --sjdbOverhang > --seedSearchStartLmax."

Can you please explain what kind of mapping behavior we can expect if -sjdbOverhang is LESS THAN --seedSearchStartLmax? For example, we are mapping 50nt paired-end reads to the human genome with standard Ensembl transcript annotations. As recommended, we built the genome at -sjdbOverhang = 49 (MateLength - 1). However, we neglected to change the --seedSearchStartLmax from the default (50). Mapping seems to have gone well, and it looks like we have plenty of reads spanning splice junctions. However, based on your earlier post, this may not have been an ideal run.

What sort of mismapping might we expect in this scenario? Similarly, what might we expect if we used slightly shorter reads (e.g. 48nt)?

Thanks so much for your help.

Alexander Dobin

unread,
Jan 29, 2014, 2:02:13 PM1/29/14
to rna-...@googlegroups.com, David O'Brien
Hi,

In general, setting --sjdbOverhang to (MateLength-1) will give you the best sensitivity for detection of annotated junctions.
Reducing  --seedSearchStartLmax will increase the overall sensitivity of mapping, including annotated and unannotated junctions, indels, multiple mismatches and other complicated cases. The reads are "split" into equal length blocks no longer than --seedSearchStartLmax. For reads ~50b long you can try reducing --seedSearchStartLmax to ~30, which will split 50b reads into two blocks (with default parameters no splitting occurs). The effect of this reduction will probably be weak, unless the sequencing quality is poor, or you are mapping to a divergent genome.

Cheers
Alex

grif...@gmail.com

unread,
Jan 29, 2014, 6:03:36 PM1/29/14
to rna-...@googlegroups.com, David O'Brien
Thanks Alex, for the helpful reply.

If I understand this correctly, then, there is not an inherent "best practice" requirement for --sjdbOverhang > --seedSearchStartLmax, it just may result in better sensitivity for unannotated junctions and the like. As mentioned, we're mapping (50nt paired reads) to human and our sequencing is of high quality -- it seems safe to assume that we're OK with --sjdbOverhang 49 and --seedSearchStartLmax of 50 and not losing much. Is this a fair assumption?

Thanks again.

Alexander Dobin

unread,
Jan 31, 2014, 11:10:49 AM1/31/14
to rna-...@googlegroups.com, David O'Brien
I think it's a fair assumption, but I would check it on a subset of samples, by using, say,   --seedSearchStartLmax 30. That will tell you how much you are losing.

For sjdbOverhang, I would formulate the general rule like that: ideally sjdbOverhang=readLength-1, but at the very least sjdbOverhang  >= min(readLength-1,seedSearchStartLmax-1)

Anna Quaglieri

unread,
May 29, 2017, 10:07:08 PM5/29/17
to rna-star, dunder...@gmail.com
Thanks for these clarifications. I would like to be sure of one more thing. If I trimmed my read and they span from 70 up to 150 bp. Shal I use 149bp as the sjdbOverhang ?

Thanks,
Anna

Alexander Dobin

unread,
May 31, 2017, 5:36:06 PM5/31/17
to rna-star, dunder...@gmail.com
Hi Anna,

--sjdbOverhang 149 for 70-150b reads is fine, but might be an overkill as the default 100 will work practically the same.

Cheers
Alex

Anna Quaglieri

unread,
Jun 7, 2017, 12:15:27 AM6/7/17
to rna-star, dunder...@gmail.com
Thanks a lot Alex!

Anna

kevin.pan

unread,
Jun 8, 2017, 12:53:45 AM6/8/17
to rna-star, dunder...@gmail.com
Hi Alex,
I can sort of find my answer by reading above threads, but I am not so sure so I want to confirm with you.
I have samples from different datasets (all paired end) with read length from 48, 70,101,110, 140, etc. 
What I want to do is to create a 100 index for 48, 70 and 101 datasets and a 140 index for 110 and 140 datasets. Do you think this is correct? Do I need a 47 index for samples with read length 48?
Based on your above explanation, it seems if I don't care too much about efficiency, the longer sjdb is safer, which means I could even use a 140 sjdb index for all samples?

Thanks

Alexander Dobin

unread,
Jun 9, 2017, 5:11:29 PM6/9/17
to rna-star, dunder...@gmail.com
Hi Kevin,

generally, I would recommend keeping it at the default 100 value for all samples.
Just to make sure it does not introduce big difference, you could make 47 and 139 indexes and check one sample for each.

Cheers
Alex

Trevor Tanner

unread,
May 29, 2018, 10:44:36 AM5/29/18
to rna-star
Hi Alex,

I'm in a similar boat as the previous posters, but have the special case that you explicitly mentioned of very short read data  (the samples have different lengths too, of course: 25bp, 31bp, 33bp, and 50bp).  Would it be wise to just make individual indices for every read type (the majority is 25bp), or is it still relatively safe to go with the longer read length (50bp) for making a single index for all of them?

Thank you for your time.

Thanks again,
Trevor

Alexander Dobin

unread,
May 30, 2018, 4:02:38 PM5/30/18
to rna-star
Hi Trevor,

I would recommend making 24b and 49b indexes, mapping the 25b datasets to both, and comparing the results - in particular, the number of splices in the Log.final.out, or - better - unique and multiple read counts per junction in the SJ.out.tab .
If the difference is little, it would be safe to use the 50b index for all lengths.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages