MAJIQ stack removal and downsampling...

166 views
Skip to first unread message

Chris Khoury

unread,
Feb 20, 2023, 5:35:17 PM2/20/23
to majiq_voila
Hi MAJIQ team,

We have 3 questions on implications for one of our RNA-sequencing data sets with PCR duplicates, uneven coverage and an unbalanced number of experiment samples. The single data set we are referring to below is on RNA samples purified from plasma. These are legacy experiments we are trialling splicing detection on in the hope we can define the correct pipeline parameters for our future experiments, and had a few questions.


1) Stack Removal
The current elephant in the room is RNA PCR duplicates.

We currently have used Picard MarkDuplicates to identify PCR sequencing duplicates and optical clusters.

Historically in these files, the duplicated reads have been removed from gene expression analysis.

Our idea is to run both types of files through; those with PCR duplicates tagged & removed, and then, those with PCR duplicates only tagged, in order to gain an idea of overlap across output.

Would your recommendation be to not remove PCR duplicates and rely on your stack removal, seeing as Picard MarkDuplicates is quite "dumb" in its removal algorithm compared to MAJIQ's stack removal approach?


2) Downsampling
We have inconsistent coverage numbers across these samples. After watching the 2017 MAJIQ analysis workshops on the BioCiphersLab youtube channel, we noted Dr Barash mentioned definitely downsampling of samples for a more uniform coverage. Whilst most of our samples fall in the first SD, a small number of replicates are up to 4 SD's different.

Should we downsample and does the above influence MAJIQ?


3) Number of samples per experiment group

At present we have 2 conditions. The disease group is double the size of the control (n=50, n=23) and sequencing coverage follows this pattern as well.

Will this influence MAJIQ's output? Should we consider separating the disease group into sub-groups (which we have the ability to do) and only consider this output?

Thanks in advance for your suggestions.



Regards,
Chris
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Matthew Gazzara

unread,
Mar 9, 2023, 5:51:23 PM3/9/23
to Biociphers

Hi Chris,


I’m terribly sorry. I wrote a detailed response to this a while ago and thought it was posted…but apparently my account has been having issues with Google Groups where they kept deleting the messages. 


First of thanks for reading the documentation and checkin out the videos. Let me address your points: 


1) Stack removal: We have not really tested Picard MarkDuplicates extensively, but in our experience it is not really suitable for MAJIQ / RNA-seq where you in fact have many “duplicate” reads that share the same start and end positions. This was the reason we originally went towards our own stack removal approach with the focus on splice junction reads. 


2) Downsampling: What is discussed in the workshop video (I think) is referring to running PSI or deltaPSI over experimental groups. In these original quantification modes of MAJIQ we are assuming the experiments input are true “replicates” (like a cell line or mouse strain) which all share the same underlying PSI, and so the reads are pooled together for all the samples in the group to estimate PSI and dPSI. In your case if the replicates are run as a group those with many more reads disproportionately influence the group PSI calculation, and so we suggested downsampling in that video in such cases. 


Other things you could do is calculate PSI per sample and then look at the distribution of expected PSI values for the replicates of your experiments. Conceptually, this is similar to what is done with our newer version of MAJIQ v2 in the new quantification mode “het.” For human data and other large and heterogeneous experiments, het drops many of the replicate assumptions of the original PSI and dPSI. This study was just published in Nature Communications and you can check it out here for more details https://www.nature.com/articles/s41467-023-36585-y 


3) Number of samples per experiment group: Given the number of samples I would recommend you try majiq-het (see above). Because PSI is ultimately a ratio between junctions out of the same reference exon of an LSV, coverage differences and number of samples should be a huge influence on those outputs…just maybe how many are discoverable and the confidence in those estimates within each group. Related to what is discoverable in your runs and given the disease subtypes and the number of samples in the group, I would recommend you check out / think about how you want the MAJIQ builder step to run. Specifically the --min-experiments option (https://biociphers.bitbucket.io/majiq/MAJIQ.html#builder:~:text=execution.%20%5BDefault%3A%20False%5D-,%2D%2Dmin%2Dexperiments,-MIN_EXP%20Threshold%20for). If in your build config file you are running two groups (control with 23 samples and disease with 50 samples grouped together), by default MAJIQ will require the LSV to meet additional filtering criteria in 50% of your samples (i.e. 25 disease and 13 control samples). If your disease subtypes are very distinct and say a de novo junction only occurs in a rare subtype from your dataset (say subtype B is only 5 samples), you would miss these things with the default settings. You can lower this threshold to something that maybe reflects the % of the disease group of your smallest subtype, run the subtypes as different experimental groups, or just change it to a float value (for % of the group) or integer value (for absolute # of samples in group) you think is reasonable. 


Hopefully this helps and gives you some more ideas / clarifications. Let me know if you have more questions or if anything is unclear!


Best,

Matt Gazzara

Chris Khoury

unread,
Jul 21, 2023, 4:59:46 AM7/21/23
to Biociphers
Hi Matt and Majiq team, Hope you are well.

Thank you for the response Matt -  Sorry for the delay (life with a newborn!) .

Would you be able to illustrate how MAJIQ detects and removes the read stacks (--markstacks), the dispersion issue you observed over splice junctions which meant you incorporated bootstrapping, plus what to consider when setting the p-value threshold to flag any stacks (the default setting p-val ≤ 10−7 was quoted as being conservative in the elife 2016 Vaquero-Garcia et al paper. For us biologists, we could do with an example :) . Should I adjust the default setting if I know my wet lab involves a high PCR amplification cycle? i.e do you have any recommendations for different PCR cycle settings.

Many thanks,
Chris.

Reply all
Reply to author
Forward
0 new messages