Hi Alex, thanks so much for this software and for all the attention you give to its continued maintenance.
Unnecessary non-technical background: I work in a lab that uses RNA-seq to look for novel splice junctions in undiagnosed disease cases. We often observe a genomic variant that displaces one side of a canonically annotated junction but leaves the other side intact. In this case, one would expect STAR's SJ.out file to accordingly list the new junction with one novel coordinate and one known coordinate. Almost all the time (> 90%), STAR does this correctly, but on rare occasions, it outputs novel coordinates on both the left and right side, off by a few bases. This is problematic for us, because it makes it significantly harder to match up the causal variant to the novel junction (it's much easier if we can expect one end to be a canonically annotated end).
Below and attached I have tried to isolate a minimal example for debugging purposes. Actually, I'm wondering if perhaps one of my assumptions is wrong, which I'll put up top here as a question: When using an sjdb, should I expect STAR to favor a novel junction that shares one end in common with a known annotated junction over a novel junction that shares zero ends with an annotated junction? If the answer is yes, then proceed... If the answer is no, then could that logic be added?
Details for the minimal example:
- The attached FASTQ files contain 74 read pairs.
- There is a canonically annotated splice junction at 4:674377-674880 (hg19).
- The sample contains a heterozygous variant 4:674373A>T, creating a competing donor site that sometimes results in the novel junction 4:674372-674880.
- When I align the FASTQ files (I am using STAR 2.6.0c with two-pass alignment, an sjdb, and default parameters for everything else), ideally I would see the following types of spliced reads:
- A: reads without the variant would show the canonically annotated junction 4:674377-674880,
- B: some reads with the variant would still splice at the canonically annotated junction 4:674377-674880,
- C: some reads with the variant would splice at the new junction 4:674372-674880.
- In the attached alignment, I see read types A & B, but I don't see type C. Instead, STAR aligns the novel junction as D: 4:674370-674878. This is a different spelling of C: 4:674372-674880, but without the benefit that one of the sides of the junction matches a canonically annotated junction. I was surprised by this, since I thought STAR would favor C over D because of the sjdb.
Could you let me know which is the expected behavior, and whether it is something that could potentially be fixed?
Thanks very much!
Lee-kai