Hi Welmoed,
Multiple hypothesis correction is in general typical of frequentists rather than Bayesian models where you have a null you are testing against (that's where the p-value comes from i.e. the chances of getting a result "at least as good" "by chance"). Since you repeat this many times you need to correct for the multiple hypothesis testing performed. However, MAJIQ's deltaPSI is a Bayesian model and in Bayesian statistics there is no direct multiple hypothesis correction. Instead, much of the protection against the issues related to multiple hypothesis comes from the prior used (I'll note there are other elements in Bayesian statistics that can help like hirarchical priors but these are not relevant here). In this case, MAJIQ dPSI has a prior which strongly favors no/small changes (this is explained in the eLife 2016 paper). To overcome this prior, an LSV must have enough read evidence. Thus, the prior protects the model from small flactuations that can happen dew to "noise" or "by chance". Moreover, the Bayesian model allows us to compute a posterior probability that the change (dPSI) is at least X% (default: 20%) with Y% confidence (default: 95%). So, if the model "works well" then for all the times it gave you such a statement (i.e. the list of LSVs that passed those thresholds) it should be right at least 95% of the time i.e. your FDR should be controlled at 0.05. In practice, if you again look at analysis perfromed in the V2 Nat Comm 2023 paper for dPSI model you see it too controls very well the FDR - it's actually more conservative than 0.05 FDR in our analysis. Of course this is not bulletproof: If the data violates the model's assumptions than your estimates might be off and your data might be somehow very different than say the GTEX data we used to simulate "ground truth" and test in the V2 where we "knew" the underlying truth values. So, how would you know *for your data*?? That's a tough challenge because, last time I checked, no oracle or crystal balls are on sale in Amazon for this unfurtantely... One way to try to get a sense for that is to repeat the same analysis with similar samples and see if you get the same answers - This is the RR curves and RR values in the V2 paper where, again, MAJIQ compares very favorably to other tools. Another way is to compute how many events are reported when the two groups are *not* supposed to have changes between them (e.g. compare 3 brain samples to another set of 3 brain samples) - the ratio between the number you get for groups that are expected to have a change (brain vs liver) to the number you get for the groups that dont (brain vs brain) is the IIR statistics suggested in the V2 paper as a proxy to FDR. Still, as we say in the paper (and as I state repeatedly in talks I give), you should always assess things like RR and IIR on your data to see how the algorithm (MAJIQ or others) behave. But again, keep in mind RR and IIR is sort of a proxy to what you are really after (FDR), it's not really that so you have to be careful in how you interpret it. First, it is possible that say between a given set of say 3 Brain and 3 Liver tissu samples LSV X is differentially spliced (and MAJIQ was correct flagging it as such), but for whatever reason, when you repeated that analysis with a different set of 3 Brain and 3 Liver samples that LSV X really isn't differentially spliced or maybe it just didn't get enough reads to overcome the prior (see above). In such a situation this will hurt your RR but not because the algorithm "was wrong". Another issue with RR analysis which I like to highlight is that it can give you inflated performance due to a strong bias which is highly reproducible. Example: My simple algorithm will rank all genes splicng changes by the gene names, ordered alphabetically. Clearly garbage - but with perfrect reproducability across datasets.....
Bottom line: dPSI is a Bayesian model so multiple hypothesis correction is in a sense "baked into the prior" and in our tests works quite well. The dPSI model assumes the samples in the group are biological replicates that share the underlying PSI (and consequently dPSI between sample groups). Since many datasets may be large but violate this assumption and be more hetrogenous in their underlying PSI or dPSI we developed the HET analysis. In the V2 paper we assessed those by a variety of metrics on real GTEX and "realistic" synthetic data based on GTEX and independent RT-PCR validations, and showed they worked well compared to other algorithms controlling FP while giving high RR and power. We also recommended testing things like RR on "your" data and using the more robust HET statistics when you worry the data is more noisy. In our expirience this works well but of course it's possible your specific data is somehow different so YMMV and you should always assess yourself by e.g. RR and IIR analysis or a set of RT-PCR validations - again, as done in the V2 paper.
This ended up a really long answer but I hope this helps clarify control of false positives.
Y.