Wha is the difference between alignSJDBoverhangMin and outSJfilterOverhangMin?

5,113 views
Skip to first unread message

Sasa Kornienko

unread,
Mar 12, 2013, 7:42:06 AM3/12/13
to rna-...@googlegroups.com
Dear all!

Could someone please explain me (as an almost complete dummie in alignment details) what is the difference between these two options:
> alignSJDBoverhangMin
from the manual: int>0: minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments

AND

> outSJfilterOverhangMin
from the manual: minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG motif, (3) GC/AG motif, (4) AT/AC motif. -1 means no output for that motif does not apply to annotated junctions

Why is default alignSJDBoverhangMin = 3 and the other one outSJfilterOverhangMin is so big (30 12 12 12) ?

Are these the same notions of the overhang? As I understand, overhang is the length of the piece of the read which is spliced apart. So alignSJDBoverhangMin = 3 for a 100bp read would be 97bp in one place and 3bp in another. Why then not allowing this piece to be so small in the outSJfilterOverhangMin?

Sorry, if it is obvious, I just got completely confused when choosing these important parameters for the alignment.

Thanks a lot!!

Alexandra

Alexander Dobin

unread,
Mar 12, 2013, 5:17:52 PM3/12/13
to rna-...@googlegroups.com
Hi Alexandra,

this is a good not obvious question - I need to work on the manual and F.A.Q.
--alignSJDBoverhangMin (and similarly --alignSJoverhangMin for unannotated junctions) prohibits alignments with very small spilce overhangs. This is done for each read independently of all other reads.
The alignments with smaller overhangs never make it to the Aligned.out.sam file.

On the other hand, --outSJfilterOverhangMin (and other outSJ* parameters) controls the output of "collapsed" junctions into the SJ.out.tab.
"Collapsing" means that the information is collected from spliced reads that support a particular junction (intron) in the genome.
While there are many millions of spliced reads, there will be only a few hundred thousands of collapsed junctions (i.e. the number of lines in SJ.out.tab).
The last column of the SJ.out.tab file reports the maximum overhang over all the reads that support each junction. 
The --outSJfilterOverhangMin parameter prohibits outputting into SJ.out.tab those junctions which for all supporting reads have overhangs smaller than 30/12/12/12 (depending on the intron motif).
This (and other parameters like --outSJfilterCountUniqueMin, --outSJfilterCountTotalMin, --outSJfilterDistToOtherSJmin, --outSJfilterIntronMaxVsReadN) allows to create a highly confident set of detected splice junctions in the SJ.out.tab.
Note that if you are using annotations (sjdb), only unannotated junctions are affected by these filtering parameters, all annotated junctions are output into SJ.out.tab without filtering.

By default  --outSJfilter* parameters do not affect alignments in the Aligned.out.sam file. 
However, if you use --outFilterType BySJout option, splicing of alignments in the Aligned.out.sam will only be allowed across the junctions which pass the filtering into SJ.out.tab.
This option makes Aligment.out.sam file consistent with SJ.out.tab file.

It probably sounds a bit cumbersome, but the logic is simple, I believe:
First, we filter all the alignments with very short ovehangs, --alignSJDBoverhangMin 3 --alignSJoverhangMin 5.
Next, we create a confident set of junctions by requiring that at least one supporting read has a large enough overhang >= --outSJfilterOverhangMin 30 12 12 12 (i.e. 12 for unannotated canonical motifs, or 30 for non-canonical).
Finally, with --outFilterType BySJout we prohibit any alignments across the junctions that did not make into the confident set.

As an example, imagine that you have an unannotated GT/AG that is crossed by three uniquely mapped spliced reads, with overhangs 6,9 and 12.
This intron will make into the SJ.out.tab file with "unique read count" (col 7) of 3, and the max overhang (col 9) of 15. Also, all three splices will be reported in Aligned.out.sam file.
If the three overhangs were 6,9 and 11, the intron would not make it into the SJ.out.tab - but by default the spliced alignments would be reported in the Aligned.out.sam.However,
if you used --outFilterType BySJout, those splices would not be allowed in the Aligned.out.sam, and their alignments would most likely contain soft clipping (S) instead of splices.

Hope this made it a bit clearer,
Cheers
Alex
Reply all
Reply to author
Forward
0 new messages