Duplicate events when extracting the events?

31 views
Skip to first unread message

qk Wang

unread,
May 29, 2025, 1:22:34 AMMay 29
to Biociphers
Hello, I would like to ask if there are any duplicate events when extracting the events?
In the "cassette.tsv" file, C1_C2 and C2_C1 are duplicated. In the "alt3and5prime.tsv" file, “E1_E2_J1” and “E2_E1_J1” are also duplicated. Is that correct? Among them, I would like to ask if the most important part is that C1_A and C2_A in the "cassette.tsv" file both represent the retention of this middle exon, and whether they both represent the connection of C1_A and C2? So, do I only need to extract C1_C2 and C1_A to extract the events in the "cassette"? The main criterion should be based on the "junction_coord", right? And the s and t in the "lvs_id" do not affect the coordinates, right? Thank you for your reply.
屏幕截图 2025-05-29 131040.png

Caleb Radens

unread,
May 29, 2025, 2:45:33 PMMay 29
to Biociphers
Hi, you're correct that C1_C2 and C2_C1 both refer to the exact same junction coordinate (exon skipping). However, from MAJIQ's point of view, this junction is quantified twice: once from the point of view of exon 4 and once from the point of view of exon 6. Although the numerator is the same from both points of view (MAJIQ utilizes the same split reads when quantifying this junction), the denominator can (usually does) differ from exon 4 vs exon 6's point of view.

Exon 4 is the reference exon for a source LSV (hence the "s" in the LSV ID), and MAJIQ quantifies splicing for the LSV from exon 4's point of view: which junctions start from exon 4? C1_C2 and C1_A. 

Exon 6 is the reference exon for a target LSV (hence the "t" in the LSV ID), and MAJIQ quantifies splicing for the LSV from exon 6's point of view: which junctions end in exon 6? C1_C2 and C2_A.

In a perfect world, with perfect RNA-Seq read coverage, a simple cassette exon would have equal numbers of reads for C1_A and C2_A. However, technical (and biological) variations often cause differences in read coverage between C1_A and C2_A (in the cartoon, you see that there is 10 vs 17 reads for eg). Also, if the module is complex, there may be other junctions that splice from exon 4 or into exon 6, which would further cause differences in the source vs target LSV quantifications of the exon skipping junction.

-Caleb Radens

qk Wang

unread,
Jun 11, 2025, 8:44:31 AMJun 11
to Biociphers
Thank you for your reply. I have some other confusions,  I used .gff3 as input to majiq build. Are the junction_coord fields in voila outputs 0-based half-open, or already 1-based like .gff3?
Should I apply start + 1 when extracting sequences using bedtools or getSeq()?  
-qk Wang

Matthew Gazzara

unread,
Jun 11, 2025, 9:07:55 AMJun 11
to Biociphers
Hi,

Yes the coordinates are consistent with the gff3 input and are 1-based starts. 

The junction coordinates correspond to the first and last nucleotide of the exons or exon regions they splice together. This means if you are interested in extracting only the intronic sequences you should add one to the junction start and subtract one from the junction end. If the junction coordinate is referencing an intron retention then the coordinates correspond to the intron exactly and are still 1 based. 

Let me know if you have more questions or if anything is unclear. 

-Matt Gazzara
Reply all
Reply to author
Forward
0 new messages