alignSJDBoverhangMin parameter

286 views
Skip to first unread message

Varun Gupta

unread,
Dec 20, 2017, 6:01:16 PM12/20/17
to rna-star
Hi Alex,
Hope you are doing well.

I was looking at my genes of interest for splicing and was trying to figure out new splices occurring in tissues. I have a situation.

By default
alignSJoverhangMin  5 (for new splices)
alignSJDBoverhangMin   3 (for annotated in GTF or provided file)

1. In some of the genes when alignSJDBoverhangMin is 3, I am getting false reads clearly.
 For example

ERR315335.198824    163    chr4    109541754    255    3M1358N66M80N32M    =    109543747    4603    CAGAATGGTCCAGCGTTTGACATACCGACGTAGGCTTTCCTACAATACAGCCTCTAACAAAACTAGGCTGTCCCGAACCCCTGGTAATAGAATTGTTTACC    BBBBBFFFFFFFFIIIFFFBFFIIIIIBFFFFFFIIIFIIFIIIBFFFFIIIFIFFIFFFFFFFFFFFBFBBBBB<BFFBFFFFFBFFFFFBFFBBBBFB<    NH:i:1    HI:i:1    AS:i:205    nM:i:0    NM:i:0    MD:Z:101    jM:B:c,21,21    jI:B:i,109541757,109543114,109543181,109543260    XS:A:+
ERR315335.2867532    99    chr4    109541754    255    3M1358N66M80N32M    =    109543691    4547    CAGAATGGTCCAGCGTTTGACATACCGACGTAGGCTTTCCCACAATACAGCCTCTAACAAAACTAGGCTGTCCCGAACCCCTGGTAATAGAATTGTTTACC    BBBFFFFFFFFFFIIFIIIIIIIIIIIIIIIIIFFFIIII0BFIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFBFF<BFFFFFFFFFBFFFFB    NH:i:1    HI:i:1    AS:i:203    nM:i:1    NM:i:1    MD:Z:40T60    jM:B:c,21,21    jI:B:i,109541757,109543114,109543181,109543260    XS:A:+

The reads here show CIGAR string of 3M before the splice and then 66M. 3M nucleotide is CAG and 3 bases before 66M is also CAG. My question is why STAR produces a gap while aligning this read rather than it being a continuous alignment. This junction was annotated in my database, so moment I changed my --alignSJDBoverhangMin to 5, the reads start to align like 69M80N32M. The question is even if the db overhang is 3, why STAR prefers a gap(I will not say this as splice because it is definitely a noise) rather than a continuous alignment. Is there a parameter for gap penalty which I can change or play around with to see if that fixes it.


2. These overhang parameters I mentioned above, do these parameters only work for terminal overhangs. For example what can you say about this CIGAR string when --alignSJDBoverhangMin is 5 and I still find reads like this:

 84M134N3M2762N14M

This 3M shown above is my 3 nt exon(which is real confirmed by RT-PCR). Overhang means some part of the exon. When I choose this to be 5(alignSJDBoverhangMin), I am still getting reads which has 3M for the 3nt exon and read splices to other exons. So the parameter alignSJDBoverhangMin takes into account the terminal overhangs?? like 84M and 14M and if the exon is small(as in my case), it will still find it even if my overhang parameter is 5? If this terminal overhangs 84M and 14M are less than 5M then it will not find it. Am I correct on this?

From my previous post here In problem 2 I was not using annotated GTF file or junctions or 2 pass method. I have started using all of them and I still get reads such as  47M660N18M10S . Running in 2 pass mode and having the junction annotated still reports it as soft clipped. Any fix for that??

Thanks a lot for all your help.

Regards
Varun

Alexander Dobin

unread,
Dec 22, 2017, 2:13:16 PM12/22/17
to rna-star
Hi Varun,

1. There is a parameter that controls scoring of annotated junctions, --sjdbScore =2 by default.
So indeed, the spliced reads are preferred over continuous alignments. This is done mostly to reduce biased mapping to pseudogenes.
If you reduce this parameter to 0, your alignment will become a multimapper with both the spliced and continuous alignment reported.
Short splice overhangs are always somewhat suspicious,  I think the best way is to filter them after mapping.

2. --alignSJDBoverhangMin indeed does not apply to a microexon surrounded by two annotated junctions.
It will apply if one of the junctions is unannotated (and so will --alignSJoverhangMin of course).

3. If you getting soft-clipping at the ends of the reads, it means that (i) there is no annotated junction to support the soft-clipped sequence; OR (ii) there is a mismatch/indel STAR cannot map through. If you send me a specific example, I can tell you which it was.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages