High SpanningFrags & Low Junction Reads within Zero TotalDistFromExon Fusion Events

Benjamin Kellman

unread,

Apr 1, 2016, 6:51:31 PM4/1/16

to STAR-Fusion

Hello,

I have been working towards parsing through this data. I've separated the fusions events into:

(1) TotalDistFromExon==0

(2) TotalDistFromExon>0

where TotalDistFromExon = RightDistFromRefExonSplice + LefttDistFromRefExonSplice.

My working hypothesis is that group 1 (fusion events occurring a total distance of zero nt from the exon splice site) will correspond to trans-splicing events while group 2 will correspond to erroneous fusions resulting from chromatin rearrangement or chromatin fusions.

I think my characterization of group 2 is reasonable since several junctions exist is most of my subjects (chromatin structure induced offset from the exon splice) while some events are rare which could be due to erroneous chromatin structure including chromatin fusion (attachment 3: [rightExonD - leftExonD] vs genomic locus). Though I'm not aware of a regulated biological process that would create a gene fusion with an offset from the exon splice site. (please correct me if I'm wrong, this is not my area)

I'm concerned that group 1 may not be trans-splicing events. Looking at the junction reads vs the spanning reads (attached 1 and 2), it appears that while there are many Spanning Fragments [1,300] there are very few (but never zero) Junction Reads [1,3]. How is this possible? I don't think this is caused by an amplification error since cursory checks show no evidence of copious duplication (I'm still looking into this). My suspicion is that these differences may be caused by inconsistent junctions in close proximity, between the same genes or indels induced gene fusion. I'm not aware of a biological mechanism that would justify this suspicion. Looking at the example below there are no obvious indels present in the CIGAR strings. If it is not indels, could this be differential trans-splicing? Below are the junction reads from 2 gene fusions (though the 2nd junction read may be a simple duplication):

$ awk '$2 > 1 && $3>10 && $6 < 1 && $9 < 1 {print}' star_fusion_SL42404_1.fusioncalls.fusion_candidates.txt

RP11-509J21.2--RP11-509J21.1 2 91 RP11-509J21.2^ENSG00000237359.1 chr9:3647477:+ 0 RP11-509J21.1^ENSG00000232104.2 chr9:3602523:+ 0

2 junction, 91 spanning

$ grep RP11-509J21.2--RP11-509J21.1 star_fusion_SL42404_1.fusioncalls.junction_breakpts_to_genes.txt

chr9 3602522 - chr9 3647478 - 2 1 1 HWI-ST1096:321:C2NR2ACXX:7:2314:9579:80954 3602523 36S62M-40p40M3S 3647442 36M64S RP11-509J21.2^ENSG00000237359.1;chr9:3647477:+;0;RP11-509J21.1^ENSG00000232104.2;chr9:3602523:+;0;RP11-509J21.2^ENSG00000237359.1--RP11-509J21.1^ENSG00000232104.2;RP11-509J21.2--RP11-509J21.1

chr9 3602522 - chr9 3647478 - 2 1 1 HWI-ST1096:321:C2NR2ACXX:6:1314:6163:36682 3602523 77S23M 3647344 100M-43p77M23S RP11-509J21.2^ENSG00000237359.1;chr9:3647477:+;0;RP11-509J21.1^ENSG00000232104.2;chr9:3602523:+;0;RP11-509J21.2^ENSG00000237359.1--RP11-509J21.1^ENSG00000232104.2;RP11-509J21.2--RP11-509J21.1

Here are the CIGAR strings from another junction from the same class with 1 junction read, 14 spanning reads:

> star_fusion_SL35353_1.fusioncalls.junction_breakpts_to_genes.txt

...

... 2 0 3 HWI-ST1096:268:D2EUMACXX:3:2112:12015:59467 6973827 15S61M7037N24M 6984214 100M76p15M85S ...

...

Supplemental questions:

-I'm not sure what to make of the SL35353_1 alignments. The beginning and end make sense and don't indicate indels but there is much more alignment than expected. How can there be 7k skipped nt?

-Does the dash indicate an alternative alignment or the read mate? What does the 'p' terms indicate?

-Is there a check internal to star-fusion to make sure that all the reads are not exactly the same and therefore due to an erroneous ligation during library prep?

Thank you in advanced, I'm happy to clarify if necessary.

Ben

Screen Shot 2016-04-01 at 1.22.57 PM.png

Screen Shot 2016-04-01 at 1.40.49 PM.png

Screen Shot 2016-04-01 at 1.24.28 PM.png

Brian Haas

unread,

Apr 2, 2016, 8:14:08 AM4/2/16

to Benjamin Kellman, STAR-Fusion

Hi Ben,

I don't think you can discriminate between trans-splicing and non-trans-spliced fusion transcripts based on numbers of spanning frags and breakpoint-overlapping reads. It is peculiar when there seems to be a paucity of breakpoint overlapping reads as compared to the spanning fragments, but the ration between the two will depend on a number of factors including the length of the reads, distribution of fragment lengths, and the position of the breakpoint relative to the full length of the fusion transcript.

Note, there's a lot of filtering that goes on in STAR-Fusion to remove those entries that are likely to be false positives. The final output file should contain the best candidates, and be sure you're using the latest release of the software, as we've made great strides in improving the specificity of the predictions in recent releases.

If you want to understand the formatting of the various intermediate files generated by STAR, there's some documentation in the STAR pdf file that describes it. Also, star-fusion does remove duplicate read pairs alignments.

best,

~b

--
You received this message because you are subscribed to the Google Groups "STAR-Fusion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to star-fusion...@googlegroups.com.
To post to this group, send email to star-...@googlegroups.com.
Visit this group at https://groups.google.com/group/star-fusion.
To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/ca6aa8e2-9204-42a7-8978-e85c2b1e6207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Benjamin Kellman

unread,

Apr 4, 2016, 2:50:41 PM4/4/16

to Brian Haas, STAR-Fusion

Dr. Haas,

Thank you for your prompt response. I think I as unclear. I am hoping to discriminate between trans-splicing and erroneous fusion events using the total distance from the exon junction ( TotalDistFromExon = RightDistFromRefExonSplice + LefttDistFromRefExonSplice ). My thinking is that trans-splicing should occur exactly at the exon junction while erroneous events can occur anywhere.

My confusion is stems from the observation that the events occurring with a TotalDistFromExon == 0 (potential trans-splicing events) have a High SpanningFrags & Low Junction Reads. This is opposite of the pattern usually reported on this forum and in your documentation.

I'm wondering how this could happen. Perhaps as a result of an inconsistent splicing site so these are a few of several junctions. Maybe it is a distribution with a peak or valley at TotalDistFromExon == 0.

I'm wondering if you can offer any insight into:

Why the ratio is flipped when TotalDistFromExon == 0

What TotalDistFromExon == 0 and TotalDistFromExon > 0 mean

What does consistency (observed in most samples) mean in the TotalDistFromExon > 0 events? If these are not errors, what biological mechanism explains splicing with TotalDistFromExon > 0?

Thanks again,

Ben

--

Benjamin P. Kellman

PhD Student

Bioinformatics and Systems Biology

UC, San Diego

Brian Haas

unread,

Apr 5, 2016, 9:09:39 AM4/5/16

to Benjamin Kellman, STAR-Fusion

Hi Ben,

I agree that those fusion predictions showing up with breakpoints that are not at reference splice junctions (or even canonical splice junctions) should be treated with suspicion. The default settings in STAR-Fusion are to require at least 3 such split reads that agree at the breakpoint position to be reported, whereas the default is only 1 breakpoint read if it occurs at reference annotated exons. There are examples of real fusions (from chromosome rearrangements) where the breakpoint is intra-exon and so that the breakpoint is not going to be at a reference exon junction. Getting the breakpoint alignments correct (or to at least agree) at non-reference breakpoints is probably harder for the aligners and could account for some of the discrepancies. Note, when you have lots of spanning frags but few breakpoint reads, this is also sometimes an indicator that you're looking at chimeric alignments between paralogs or genes that have shared domains, and these could be real or stem from PCR artifacts. Likely paralogs are auto-filtered as part of STAR-Fusion, but it's not likely to catch them all. If you're looking at the intermediate outputs, you'll see the raw data pre-filtering, and likely plenty of examples of high-span low-breakpoint counts that end up getting filtered out later on as likely artifacts.

best,

~b

Brian Haas

unread,

Apr 5, 2016, 9:11:57 AM4/5/16

to Benjamin Kellman, STAR-Fusion

another comment - I don't think you're going to find a way to discriminate between trans-splicing and 'genuine' fusion events from translocations without supplementing your RNA-Seq data with DNA-Seq data so you can find DNA-level evidence of chromosomal rearrangements.

~b

Benjamin Kellman

unread,

Apr 7, 2016, 3:04:59 AM4/7/16

to Brian Haas, STAR-Fusion

Thank you very much, this was a very helpful answer. Best of luck.

Ben

On Tue, Apr 5, 2016 at 6:11 AM, Brian Haas <bh...@broadinstitute.org> wrote:

another comment - I don't think you're going to find a way to discriminate between trans-splicing and 'genuine' fusion events from translocations without supplementing your RNA-Seq data with DNA-Seq data so you can find DNA-level evidence of chromosomal rearrangements.

~b

Brian Haas

unread,

Apr 7, 2016, 7:22:22 AM4/7/16

to Benjamin Kellman, STAR-Fusion

Another thought - if the breakpoint involves low entropy flanking sequence (simple repeats) then the number of breakpoint reads will be underestimated due to active filtering or difficulty in achieving proper alignment.

Most of the time it involves paralogs or shared sequence domains, from what I've encountered.

Best,

-Brian

(by iPhone)

Benjamin Kellman

unread,

Apr 7, 2016, 8:26:13 PM4/7/16

to Brian Haas, STAR-Fusion

Ah, yes that makes sense too.

So that brings me back to one of my original questions regarding your CIGAR strings:

3602523 36S62M-40p40M3S 3647442 36M64S

3602523 77S23M 3647344 100M-43p77M23S

(assuming the '-' indicates two alternative alignments). It seems like the second example shows an a complementary alignment over a junction 77S23M/77M23S but also shows a perfect matching (100M) for the second position. Would this be an example of a potential paralogy error? Where one read could align perfectly to a gene but because of low entropy, paralogy or shared domain, there is partial homology with another gene resulting in an apparent junction?

Thanks again,

Ben

Brian Haas

unread,

Apr 7, 2016, 9:19:24 PM4/7/16

to Benjamin Kellman, STAR-Fusion

This is just how Alex Dobin configured STAR to report the paired fragment alignments. The cigar string with the 'p' is a paired-end cigar string (as Alex defines). It's basically read1-p-read2 where the p-number indicates the distance between the two alignments. You'll generally see that the part of one of the reads that's soft-clipped ends up aligning in the other separate cigar string.

From page-11 of the STAR manual (pdf):

column 14: CIGAR of the second segment
Unlike standard SAM, both mates are recorded in one line here. The gap of length L between the
mates is marked by the p in the CIGAR string. If the mates overlap, L<0.
For strand denitions, when aligning paired end reads, the sequence of the second mate is reverse
complemented.