majiq update queries | documentation | heterogen vs deltapsi

Oliver Ziff

unread,

Mar 11, 2022, 8:22:55 AM3/11/22

to majiq_voila

Hi there,

Thank you for the great updates in the latest version of MAJIQ. I am trying to get to grips with the latest version updates and have a few questions:

Which documentation should we be following? https://biociphers.bitbucket.io/majiq/ or https://biociphers.bitbucket.io/majiq-docs-academic/ The latter seems more up to date but incomplete.

heterogen vs deltapsi

When should we use heterogen and when deltapsi? It seems deltapsi should be used for replicates from a single experiment and heterogen for large heterogenous populations? The heterogen example seems to be on ENCODE replicates from a single experiment?

heterogen voila tsv output

I am not clear what a few columns in the Voila TSV output refer to, specifically:
- {grp1,grp2}_num_quantified
- tnom_score

The E(dPSI) value has been omitted. It is suggested to use the E[PSI] difference as a measure of effect size but it is not included in the output. https://biociphers.bitbucket.io/majiq-docs/getting-started-guide/statistics.html Should this be calculated simply as grp1_median_psi - grp2_median_psi?

Have the test statistics been adjusted for repeat testing?

deltapsi voila tsv output

I saw in a forum post that probability_changing does not represent p-values and instead posterior probabilities. I know we can set dPSI thresholds e.g. 10% or 20% but what significance threshold is recommended on probability_changing?

Although the Voila TSV outputs have been tidied up (thank you!), there are still a few issues for those of us that use R:

"-" in the column names

";" within observations

"True", "False" not recognised as logical - need to be in CAPS

To address these issues I made an R function built around tidyverse that might be helpful for others to import and tidy the voila.tsv output:

read_voila_het <- function(voila.tsv, grp1 = NULL, grp2 = NULL){
if(file.exists(voila.tsv) == FALSE){ stop("MAJIQ voila.tsv file not found") }
read_tsv({{voila.tsv}}, skip = 15) %>% clean_names() %>%
separate_rows(c("tnom", "ttest","wilcoxon","tnom_quantile","ttest_quantile","wilcoxon_quantile","tnom_score" ,"changing","nonchanging", paste0({{grp1}},"_median_psi"), paste0({{grp2}},"_median_psi"), paste0({{grp1}},"_percentile25_psi"), paste0({{grp2}},"_percentile25_psi"), paste0({{grp1}},"_percentile75_psi"), paste0({{grp2}},"_percentile75_psi"), "de_novo_junctions","junctions_coords"), sep = ";", convert = TRUE) %>% #,"exons_coords"
mutate(gene_id = gsub("gene:","",gene_id), lsv_id = gsub("gene:","",lsv_id),
tnom_padj = p.adjust(tnom), ttest_padj = p.adjust(ttest), wilcoxon_padj = p.adjust(wilcoxon),
deltapsi = !!sym(paste0({{grp1}},"_median_psi")) - !!sym(paste0({{grp2}},"_median_psi")), changing_dpsi0.2 = as.logical(changing), changing_dpsi0.1 = case_when(tnom < 0.05 & abs(deltapsi) > 0.1 ~ TRUE, TRUE ~ FALSE)) %>%
arrange(-abs(deltapsi)) %>% select(gene_name, gene_id,lsv_id, deltapsi, tnom, ttest, wilcoxon, ends_with("_padj"), ends_with("_median_psi"), changing_dpsi0.2, changing_dpsi0.1, everything())
}

voila view

Why do some groups not display violin plots for LSVs when there are reads visible in the splicegraph for that group?

In the splicegraphs, what do the numbers above the splicing junctions represent? Presumably, they are some form of a normalised splice junction read count?

Finally, I wondered if MAJIQ could be used to quantify splicing in total RNA libraries ± Ribo-Zero RNAseq experiments e.g. with TruSeq Stranded total RNA? i.e. not undergone oligoDT enrichment for polyA mRNA?

Many thanks!
Oliver

Paul Jewell

unread,

Mar 14, 2022, 10:55:10 AM3/14/22

to majiq_voila

Hello Oliver,

- Please use https://biociphers.bitbucket.io/majiq/ - We will look to add more examples to the gallery and update the builder tool, but in general it should be complete otherwise. If there are specific sections you found confusing, please let us know and I'd be happy to look into making them more clear.

-Het vs dpsi. I'll direct someone with more finesse to elaborate on the high level use case for het vs dpsi, but in general the HET run in the documentation example is not from a single experiment, unless our definition of experiment differs: "We selected RNA-seq experiments corresponding to SRSF4 knockdown (by CRISPR) vs no-target controls in K562 and HepG2 cells from ENCODE".

- "num_quantified" -- the number of samples in that group for which the LSV was quantified (psi > 0 / exists). Note that build groups (in the config file for build) may be specified differently from HET groups (which are given in the -grp1/-grp2 arguments to 'majiq het')

- TNOM: see here, page 29 specifically. https://bioinformatics.mdanderson.org/Software/OOMPA/Current/ClassComparison-manual.pdf

voila view: Existence of reads does not guarantee statistical existence of a quantified LSV. The numbers for HET in voila view represent median raw reads.

I'm going to request a few additional comments here from a few other members of my group with a little more experience in some of the other questions. But please let me know if any of these responses are still not clear in the meantime.

Matthew Gazzara

unread,

Mar 14, 2022, 1:10:26 PM3/14/22

to majiq_voila

Hi Oliver,

Thanks for your interest and questions. Hopefully I can add on a bit to what Paul explained above:

heterogen vs deltapsi

You're correct that deltapsi is intended for experimental group comparisons where the samples that make up the groups are replicates (e.g. wild-type versus knockdown cell lines) while heterogen is for experimental groups with larger and/or more heterogenous samples (e.g. GTEx human tissues). You're right that the ENCODE experiment in heterogen example may also make sense to run deltapsi on as these should be replicate cell lines. One reason it could make sense to run heterogen in the vignette example is that there are two distinct sets of control and CRISPR KO cells: one released in 2017 and one in 2020. We may expect these experimental batches to behave slightly differently. We used het in this way when running all ENCODE knockdowns in the lab's recent paper on batch correction on MAJIQ data using our tool Moccasin (https://www.nature.com/articles/s41467-021-23608-9).

MAJIQ on total RNA libraries

Yes MAJIQ would work on ribo minus RNA-seq experiments. We expect that polyA+ libraries are cleaner / more powerful to detect splicing events, given the same number of uniquely mapped reads, because MAJIQ mostly relies on junction-spanning reads for quantification. A total RNA library would likely have more intronic reads from pre-mRNAs that are not yet fully processed. To detect and quantify intron retention, however, MAJIQ looks for read coverage across the introns in bins. For this reason, a total RNA-seq library will likely detect more intron retention events (IR), compared to a polyA selected library. You can raise the --min-intronic-cov parameter at the build stage if you're detecting more IR than you might expect. We haven't tested this in total RNA libraries to give a recommendation on a specific value to set for this parameter. If your two conditions are both total RNA libraries then this bias may not be a big concern, although differences in splicing kinetics between experimental groups could complicate things. I would also advise against comparing presence/absence of intron retention events or inclusion quantification values of introns if you are interested in comparing a total RNA library condition versus a polyA+ library condition.

All the best,

Matt Gazzara

Oliver Ziff

unread,

Mar 29, 2022, 8:58:22 AM3/29/22

to majiq_voila

Thank you very much Paul and Matthew! That is really helpful.

There are a few remaining queries:

heterogen voila tsv output

The E(dPSI) value has been omitted. It is suggested to use the E[PSI] difference as a measure of effect size but it is not included in the output. https://biociphers.bitbucket.io/majiq-docs/getting-started-guide/statistics.html Should this be calculated simply as grp1_median_psi - grp2_median_psi? Plotting this delta PSI against the -log10 of the test statistics (tnom, wilcoxon and ttest) produces a pyramid or diamond shape (I was expecting a volcano shape, which is what I see with majiq deltapsi), - i.e. many LSVs with very small delta PSI and tiny p-values (and conversely many with large delta PSIs but large p-values).

Have the test statistics been adjusted for repeat testing?

deltapsi voila tsv output

I saw in a forum post that probability_changing does not represent p-values and instead posterior probabilities. I know we can set dPSI thresholds e.g. 10% or 20% but what significance threshold is recommended on probability_changing?

The median raw read counts shown in voila view would be extremely helpful to have in deltapsi voila tsv output. How can I add these to the tsv?

Many thanks for all your help with this,

Oliver

Paul Jewell

unread,

Apr 1, 2022, 2:24:21 PM4/1/22

to majiq_voila

Hello Oliver,

There are a few queries here I will need to redirect to some of my other lab members, but to answer the ones I can:

For voila tsv on PSI or DELTAPSI inputs, you may include read counts with the non-default option --show-read-counts . However, I've generally been advised against 'trusting' this value for some types of analysis, which is why it is only a hidden / debugging option at this point. I'm going to get clarification and post back here with it. for voila tsv on HET inputs, the --show-read-counts option does not exist, however, I will look into porting it over shortly into a minor-release update. (the reason was mainly that these options were only tested for one internal use case anyway, but I have no problems making it consistent)

Paul Jewell

unread,

Apr 7, 2022, 11:56:30 AM4/7/22

to majiq_voila

Hello Oliver,

I just wanted to copy one thing a lab member, David Wang, pointed about with part of your post:

I feel like his second point is a bug. If you -log10(posterior probs changing) you will see the pyramid. Assuming "test statistic" here means p-value, the het plot should be the one that has the v shape. I think he got het and deltapsi backwards. A low p-value should correspond to higher fold change so if you use het and plot (psi1-psi2) on the x axis and -log10(p) on the y, you get the volcano shape (i.e. V shape). A higher posterior changing is better for deltapsi so the shape is opposite (i.e. inverted V, so pyramid shape).

Do you think it's possible the plot was made with the reversion implied?

Oliver Ziff

unread,

Apr 7, 2022, 1:18:59 PM4/7/22

to majiq_voila

Thanks Paul but no. I am plotting the output of heterogen as y-axis: -log10(tnom p-value) and x-axis: psi1-psi2.

"If you -log10(posterior probs changing) you will see the pyramid. " This is not the case with majiq deltapsi output - e.g. fig 1a in https://www.nature.com/articles/s41586-022-04436-3/figures/1. I see the same U shape with deltapsi output plotting y-axis: -log10(probability_changing) and x-axis: (mean_dpsi_per_lsv_junction).

majiq update queries | documentation | heterogen vs deltapsi | tsv output

Oliver Ziff

Paul Jewell

Matthew Gazzara

Oliver Ziff

Paul Jewell

Paul Jewell

Oliver Ziff