Calculation of inclusion level for exon skipping events

75 views
Skip to first unread message

Marion Nyaboke

unread,
Oct 17, 2024, 9:31:49 AM10/17/24
to rMATS User Group

Hi,

I am working on PSI calculations for alternative splicing, particularly exon skipping events. I’m uncertain about the treatment of junction reads supporting the inclusion isoform.

For instance, when calculating PSI for an exon inclusion event, reads spanning both upstream and downstream junctions are often used. If we sum these reads, is there a risk of double-counting the inclusion isoform, given that both junctions essentially represent the same splicing event? Would this approach lead to an overrepresentation of the inclusion reads in the PSI calculation?

Alternatively, would averaging the reads from the two junctions better reflect the actual inclusion level, or is there a specific reason why summing is considered the correct approach? I could not find an answer in the documentation or the source code.

Thanks,

Marion

kutsc...@gmail.com

unread,
Oct 18, 2024, 2:31:38 PM10/18/24
to rMATS User Group
In rMATS v4.3.0 a new parameter (--individual-counts) is added which adds output columns showing the counts for the different junctions in the event. Here's an example:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen PValue FDR IncLevel1 IncLevel2 IncLevelDifference upstream_to_target_count target_to_downstream_count target_count upstream_to_downstream_count
441 "ENSG00000166049.11" "PASD1" chrX + 151611663 151611753 151604645 151604734 151620929 151621029 441 54,48,51 0,0,2 301,180,223 23,13,12 239 149 1.61648459519e-05 0.001265509501144563 1.0,1.0,0.941 0.891,0.896,0.921 0.078 36,33,34,230,122,147 32,25,34,151,103,127 0,0,0,0,0,0 0,0,2,23,13,12

The PSI (IncLevel) is calculated from the counts:
IJC_SAMPLE_1 54,48,51
SJC_SAMPLE_1 0,0,2
IJC_SAMPLE_2 301,180,223
SJC_SAMPLE_2 23,13,12

And normalized by the effective isoform lengths:
IncFormLen 239
SkipFormLen 149

For example, the first sample in group 2 has PSI = (301/239)/((301/239) + (23/149)) = 0.8908
IncLevel1 1.0,1.0,0.941
IncLevel2 0.891,0.896,0.921

From the --individual-counts columns you can determine how many reads covered both inclusion junctions. The sum of upstream and downstream would be 230 + 151 = 381, but the actual count used in the PSI calculation is 301 so there are 80 reads for that sample that cover both junctions (upstream_to_target=150, target_to_downstream=71, both=80):
upstream_to_target_count 36,33,34,230,122,147
target_to_downstream_count 32,25,34,151,103,127
target_count 0,0,0,0,0,0
upstream_to_downstream_count 0,0,2,23,13,12

Each read is counted at most once in the PSI calculation. The effective length normalization accounts for the inclusion isoform having two junctions. The IncFormLen is higher than SkipFormLen which reflects the two inclusion junctions versus the single skipping junction. The effective length calculations are shown in the README: https://github.com/Xinglab/rmats-turbo/tree/v4.3.0?tab=readme-ov-file#output

Eric

X L

unread,
Oct 18, 2024, 3:58:30 PM10/18/24
to rMATS User Group
Hi, Eric,

In your example, what is the "upstream_to_downstream_count"?

Xiao

kutsc...@gmail.com

unread,
Oct 21, 2024, 8:39:02 AM10/21/24
to rMATS User Group
The upstream_to_downstream_count is the count of reads supporting the skipping junction of the skipped exon event. Those reads cover the junction between the upstream and downstream exons

Eric

Marion Nyaboke

unread,
Oct 23, 2024, 11:06:41 AM10/23/24
to kutsc...@gmail.com, rMATS User Group
So if I understand correctly, the PSI calculation ensures that reads spanning multiple junctions are not double-counted but assumes that each read in upstream_to_target and the target_to_downstream (not shared) comes from a completely distinct isoform and is counted as such? 

kutsc...@gmail.com

unread,
Oct 23, 2024, 3:20:38 PM10/23/24
to rMATS User Group
Yes, the PSI calculation ensures that reads spanning multiple junctions are not double counted. The IJC and SJC columns are used for the PSI calculation. The other columns like upstream_to_target_count are optional and just provide extra information

Here is the code for counting reads toward SE events: https://github.com/Xinglab/rmats-turbo/blob/v4.3.0/rMATS_pipeline/rmatspipeline/rmatspipeline.pyx#L1951
A read can be counted in at most one of those 'if' statements. Each read can only be counted once for either IJC or SJC, but it could be counted as both upstream_to_target_count and target_to_downstream_count if it covers both junctions

Eric

Marion Nyaboke

unread,
Oct 24, 2024, 5:12:58 AM10/24/24
to kutsc...@gmail.com, rMATS User Group
Got it. Thank you, Eric.

--
You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rmats-user-group/265145f8-9624-424a-aacb-97e28ea2a638n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages