Hi Alex,
Hope you are doing well.
I was looking at my genes of interest for splicing and was trying to figure out new splices occurring in tissues. I have a situation.
By default
alignSJoverhangMin 5 (for new splices)
alignSJDBoverhangMin 3 (for annotated in GTF or provided file)
1. In some of the genes when alignSJDBoverhangMin is 3, I am getting false reads clearly.
For example
ERR315335.198824 163 chr4 109541754 255
3M1358N66M80N32M = 109543747 4603 CAGAATGGTCCAGCGTTTGACATACCGACGTAGGCTTTCCTACAATACAGCCTCTAACAAAACTAGGCTGTCCCGAACCCCTGGTAATAGAATTGTTTACC BBBBBFFFFFFFFIIIFFFBFFIIIIIBFFFFFFIIIFIIFIIIBFFFFIIIFIFFIFFFFFFFFFFFBFBBBBB<BFFBFFFFFBFFFFFBFFBBBBFB< NH:i:1 HI:i:1 AS:i:205 nM:i:0 NM:i:0 MD:Z:101 jM:B:c,21,21 jI:B:i,109541757,109543114,109543181,109543260 XS:A:+
ERR315335.2867532 99 chr4 109541754 255
3M1358N66M80N32M = 109543691 4547 CAGAATGGTCCAGCGTTTGACATACCGACGTAGGCTTTCCCACAATACAGCCTCTAACAAAACTAGGCTGTCCCGAACCCCTGGTAATAGAATTGTTTACC BBBFFFFFFFFFFIIFIIIIIIIIIIIIIIIIIFFFIIII0BFIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFBFF<BFFFFFFFFFBFFFFB NH:i:1 HI:i:1 AS:i:203 nM:i:1 NM:i:1 MD:Z:40T60 jM:B:c,21,21 jI:B:i,109541757,109543114,109543181,109543260 XS:A:+
The reads here show CIGAR string of 3M before the splice and then 66M. 3M nucleotide is CAG and 3 bases before 66M is also CAG. My question is why STAR produces a gap while aligning this read rather than it being a continuous alignment. This junction was annotated in my database, so moment I changed my --alignSJDBoverhangMin to 5, the reads start to align like
69M80N32M. The question is even if the db overhang is 3, why STAR prefers a gap(I will not say this as splice because it is definitely a noise) rather than a continuous alignment. Is there a parameter for gap penalty which I can change or play around with to see if that fixes it.
2. These overhang parameters I mentioned above, do these parameters only work for terminal overhangs. For example what can you say about this CIGAR string when --alignSJDBoverhangMin is 5 and I still find reads like this:
84M134N
3M2762N
14MThis 3M shown above is my 3 nt exon(which is real confirmed by RT-PCR). Overhang means some part of the exon. When I choose this to be 5(
alignSJDBoverhangMin), I am still getting reads which has 3M for the 3nt exon and read splices to other exons. So the parameter alignSJDBoverhangMin takes into account the terminal overhangs?? like 84M and 14M and if the exon is small(as in my case), it will still find it even if my overhang parameter is 5? If this terminal overhangs 84M and 14M are less than 5M then it will not find it. Am I correct on this?
From my previous post
here In problem 2 I was not using annotated GTF file or junctions or 2 pass method. I have started using all of them and I still get reads such as 47M660N18M
10S . Running in 2 pass mode and having the junction annotated still reports it as soft clipped. Any fix for that??
Thanks a lot for all your help.
Regards
Varun