Hi Chris,
I’m terribly sorry. I wrote a detailed response to this a while ago and thought it was posted…but apparently my account has been having issues with Google Groups where they kept deleting the messages.
First of thanks for reading the documentation and checkin out the videos. Let me address your points:
1) Stack removal: We have not really tested Picard MarkDuplicates extensively, but in our experience it is not really suitable for MAJIQ / RNA-seq where you in fact have many “duplicate” reads that share the same start and end positions. This was the reason we originally went towards our own stack removal approach with the focus on splice junction reads.
2) Downsampling: What is discussed in the workshop video (I think) is referring to running PSI or deltaPSI over experimental groups. In these original quantification modes of MAJIQ we are assuming the experiments input are true “replicates” (like a cell line or mouse strain) which all share the same underlying PSI, and so the reads are pooled together for all the samples in the group to estimate PSI and dPSI. In your case if the replicates are run as a group those with many more reads disproportionately influence the group PSI calculation, and so we suggested downsampling in that video in such cases.
Other things you could do is calculate PSI per sample and then look at the distribution of expected PSI values for the replicates of your experiments. Conceptually, this is similar to what is done with our newer version of MAJIQ v2 in the new quantification mode “het.” For human data and other large and heterogeneous experiments, het drops many of the replicate assumptions of the original PSI and dPSI. This study was just published in Nature Communications and you can check it out here for more details https://www.nature.com/articles/s41467-023-36585-y
3) Number of samples per experiment group: Given the number of samples I would recommend you try majiq-het (see above). Because PSI is ultimately a ratio between junctions out of the same reference exon of an LSV, coverage differences and number of samples should be a huge influence on those outputs…just maybe how many are discoverable and the confidence in those estimates within each group. Related to what is discoverable in your runs and given the disease subtypes and the number of samples in the group, I would recommend you check out / think about how you want the MAJIQ builder step to run. Specifically the --min-experiments option (https://biociphers.bitbucket.io/majiq/MAJIQ.html#builder:~:text=execution.%20%5BDefault%3A%20False%5D-,%2D%2Dmin%2Dexperiments,-MIN_EXP%20Threshold%20for). If in your build config file you are running two groups (control with 23 samples and disease with 50 samples grouped together), by default MAJIQ will require the LSV to meet additional filtering criteria in 50% of your samples (i.e. 25 disease and 13 control samples). If your disease subtypes are very distinct and say a de novo junction only occurs in a rare subtype from your dataset (say subtype B is only 5 samples), you would miss these things with the default settings. You can lower this threshold to something that maybe reflects the % of the disease group of your smallest subtype, run the subtypes as different experimental groups, or just change it to a float value (for % of the group) or integer value (for absolute # of samples in group) you think is reasonable.
Hopefully this helps and gives you some more ideas / clarifications. Let me know if you have more questions or if anything is unclear!
Best,
Matt Gazzara