How to use MOCCASIN API for polyadenylation events?

76 views
Skip to first unread message

Hsuan-lin Her

unread,
Jun 25, 2024, 7:13:03 PM6/25/24
to Biociphers
Hi MAJIQ development team,

I am Charlene, a graduate student from Yeo lab, UCSD. We are trying to study RNA-binding proteins but our RNA-seq has rampant batch effect. I came across MOCCASIN and was very impressed. I was wondering if I can repurpose the algorithm to correct batch effects in terms of quantifying polyA site usage.

PolyA site usage is also a ratio, just like the PSI in splicing. Specifically, the PSI of a polyA site quantifies the fraction of read supporting a particular polyA site out of all possible polyA sites in a gene.

I wonder if the following way is logical, following the example here for the MOCCASIN API:

```
#line 72
output1 = model.model_with_bootstraps(grouping_data, coverage_data)
```
I am most confused about what is grouping_data and coverage_data. Here are my guesses:

1. grouping_data: I will to specify all the alternative polyA sites in each gene/transcript in `grouping_data`, specifying the indices which row a new gene begins.
2. coverage_data: I will specify the normalized read counts (such as TPM?) supporting each polyA site per batch. Each matrix is (number of polyA site *. number of samples)? Please advise what is the the appropriate normalizing method.

Any feedback will be tremendously helpful!
Thank you again for making such awesome software and suites for splicing!

Charlene


bsl...@seas.upenn.edu

unread,
Jul 3, 2024, 3:06:36 PM7/3/24
to Biociphers

Hi Charlene,

It is possible to adapt Moccasin to other situations for confounding-factor adjustment with ratios, as for your Poly-A case. My disclaimer is that, while the code and math should work, I haven’t evaluated how it will perform for Poly-A ratio adjustments.

As to your specific questions:

For MAJIQ/LSVs, coverage_data is a matrix per input sample. Each row is a splice junction and each column is a bootstrap replicate index (default 30 for MAJIQ). The row i, column j entry is the read rate calculated for the i’th junction and j’th bootstrap. It is not normalized; Moccasin does normalization as part of its modeling, and then returns adjusted read rates. (That said, the calculation of read rates itself goes through some quality control steps upstream.) The MAJIQ Quantifier considers variance in the bootstrap read rates (i.e., across columns) to model the posterior PSI distribution. It’s possible your scenario would not use bootstraps, in which case you would have just one column. You could then convert the group values to ratios (normalizing within groups) to get the final Poly-A ratios per gene. Alternatively, if you had bootstrap replicates, you could consider variance across the bootstrap values per Poly-A site, analogous to what the MAJIQ Quantifier does for PSI calculations.

For MAJIQ/LSVs, grouping_data is an array of LSV (group) start indexes. For example if this starts [0, 3, 5…] it means coverage rows 0, 1, 2 are the junctions of the first LSV; 3, 4 are the junctions of the second LSV; etc… The order is assumed to be shared across the coverage_data matrices for all input samples. So, yes, I think this is what you wrote for grouping_data, with genes instead of LSVs.

Best Regards,

Barry

Hsuan-lin Her

unread,
Jul 3, 2024, 6:45:20 PM7/3/24
to Biociphers, bsl...@seas.upenn.edu, Pratibha Jagannatha
Hi Barry!

Thank you so much for this detailed response!!! 
This is extremely helpful! I am very grateful for your help!

I will take a look and try with what you suggested, perhaps bootstrap a little bit.



Charlene
--
You received this message because you are subscribed to the Google Groups "Biociphers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to majiq_voila...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/majiq_voila/796e375b-dc76-453c-82d5-8e39bae57594n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages