multiple testing correction?

Welmoed van Zuiden

unread,

Nov 3, 2025, 4:01:08 PM11/3/25

to Biociphers

Hi,

Are the 'probability changing' values in deltaPSI and the TNOM (and TTEST and WILCOXON) values in heterogen analysis adjusted for multiple testing? If not, what is the reason and what is the recommended method to adjust for multiple hypotheses, since MAJIQ is always transcriptome-wide?

Thank you

yos...@biociphers.org

unread,

Nov 17, 2025, 4:00:19 PM11/17/25

to Biociphers

Hi,

The statistical tests used in MAJIQ HET are *not* adjusted for multiple hypothesis testing. The reason is simple: What HET reports are a set of events that pass two filters - the p-value filter you are asking about, but also a threshold on the difference between the medians of the expected PSI of samples from each group. Thus, if you want to do multiple hypothesis correction you need a null model for such a combined test. We are not aware of an appropriate and easy to compute numerically null distribution for this. Likely such a null would need to be adjusted to specific characteristics of the specific dataset. Remember that it is meant to deal with "heterogenous" data (i.e. non "clean" biological/technical replicates) and as you can imagine the reasons for this heterogeneity can have major effects on what a good null would look like. Another point to make relates to the motivation to do such multiple hypothesis correction. Typically, you would want a good null model with multiple hypothesis correction to control well your FP (e.g. as in FDR). What we see in practice is that the combination of our two filters discussed above controls very well the levels of FDR. For evidence for this, you can see this in our Vaquero et al Nat Comm 2023 paper (where we used GTEX donor samples in several tissues) where our approach controls FDR at least at the level of other methods (actually much better in some cases). Our practical experience in many other works we did with our/collaborators data also supports this approach, but of course YMMV. As a side note (in case that helps your specific cases), we discussed in that paper and in other threads on this forum that for very low number of samples (e.g. n=3) users can use TNOM score = 0 (i.e. perfect separation) with the dPSI value as a good way to control for FP simply because p-values for rank tests such as TNOM will not be low for such a low number of samples.

Hope this helps clarify the reasons why multiple hypothesis testing is not applied in HET and why it should still be a robust/efficient method to identify splicing changes.

Y.

Welmoed van Zuiden

unread,

Nov 19, 2025, 5:47:35 AM11/19/25

to Biociphers

Hi,

Thanks for the explanation! So does the same go for the 'probability changing' statistic in the deltaPSI approach? Since there are the same two filters there?

Best,
Welmoed

yos...@biociphers.org

unread,

Nov 25, 2025, 11:11:07 PM11/25/25

to Biociphers

Hi Welmoed,

Multiple hypothesis correction is in general typical of frequentists rather than Bayesian models where you have a null you are testing against (that's where the p-value comes from i.e. the chances of getting a result "at least as good" "by chance"). Since you repeat this many times you need to correct for the multiple hypothesis testing performed. However, MAJIQ's deltaPSI is a Bayesian model and in Bayesian statistics there is no direct multiple hypothesis correction. Instead, much of the protection against the issues related to multiple hypothesis comes from the prior used (I'll note there are other elements in Bayesian statistics that can help like hirarchical priors but these are not relevant here). In this case, MAJIQ dPSI has a prior which strongly favors no/small changes (this is explained in the eLife 2016 paper). To overcome this prior, an LSV must have enough read evidence. Thus, the prior protects the model from small flactuations that can happen dew to "noise" or "by chance". Moreover, the Bayesian model allows us to compute a posterior probability that the change (dPSI) is at least X% (default: 20%) with Y% confidence (default: 95%). So, if the model "works well" then for all the times it gave you such a statement (i.e. the list of LSVs that passed those thresholds) it should be right at least 95% of the time i.e. your FDR should be controlled at 0.05. In practice, if you again look at analysis perfromed in the V2 Nat Comm 2023 paper for dPSI model you see it too controls very well the FDR - it's actually more conservative than 0.05 FDR in our analysis. Of course this is not bulletproof: If the data violates the model's assumptions than your estimates might be off and your data might be somehow very different than say the GTEX data we used to simulate "ground truth" and test in the V2 where we "knew" the underlying truth values. So, how would you know *for your data*?? That's a tough challenge because, last time I checked, no oracle or crystal balls are on sale in Amazon for this unfurtantely... One way to try to get a sense for that is to repeat the same analysis with similar samples and see if you get the same answers - This is the RR curves and RR values in the V2 paper where, again, MAJIQ compares very favorably to other tools. Another way is to compute how many events are reported when the two groups are *not* supposed to have changes between them (e.g. compare 3 brain samples to another set of 3 brain samples) - the ratio between the number you get for groups that are expected to have a change (brain vs liver) to the number you get for the groups that dont (brain vs brain) is the IIR statistics suggested in the V2 paper as a proxy to FDR. Still, as we say in the paper (and as I state repeatedly in talks I give), you should always assess things like RR and IIR on your data to see how the algorithm (MAJIQ or others) behave. But again, keep in mind RR and IIR is sort of a proxy to what you are really after (FDR), it's not really that so you have to be careful in how you interpret it. First, it is possible that say between a given set of say 3 Brain and 3 Liver tissu samples LSV X is differentially spliced (and MAJIQ was correct flagging it as such), but for whatever reason, when you repeated that analysis with a different set of 3 Brain and 3 Liver samples that LSV X really isn't differentially spliced or maybe it just didn't get enough reads to overcome the prior (see above). In such a situation this will hurt your RR but not because the algorithm "was wrong". Another issue with RR analysis which I like to highlight is that it can give you inflated performance due to a strong bias which is highly reproducible. Example: My simple algorithm will rank all genes splicng changes by the gene names, ordered alphabetically. Clearly garbage - but with perfrect reproducability across datasets.....

Bottom line: dPSI is a Bayesian model so multiple hypothesis correction is in a sense "baked into the prior" and in our tests works quite well. The dPSI model assumes the samples in the group are biological replicates that share the underlying PSI (and consequently dPSI between sample groups). Since many datasets may be large but violate this assumption and be more hetrogenous in their underlying PSI or dPSI we developed the HET analysis. In the V2 paper we assessed those by a variety of metrics on real GTEX and "realistic" synthetic data based on GTEX and independent RT-PCR validations, and showed they worked well compared to other algorithms controlling FP while giving high RR and power. We also recommended testing things like RR on "your" data and using the more robust HET statistics when you worry the data is more noisy. In our expirience this works well but of course it's possible your specific data is somehow different so YMMV and you should always assess yourself by e.g. RR and IIR analysis or a set of RT-PCR validations - again, as done in the V2 paper.

This ended up a really long answer but I hope this helps clarify control of false positives.