Moccasin doubts

Alessandra Longo

unread,

Jun 26, 2024, 5:06:29 AM6/26/24

to Biociphers

Hi MAJIQ development team,

This is my first time analyzing RNA-Seq data for alternative splicing, so I apologize in advance if the answer to my questions may be obvious
I have some doubts regarding the application of Moccasin for correction of .majiq build files
In my specific case, I would like to correct the data exclusively for unknown confounding factors (trying for k=2 and k=3), so in the model matrix I only reported the file name and group (0 for control samples and 1 for treated ones). However running the command with -U flag and using the output for Majiq Quantifier and Voila, I got strange results, with PSI and deltaPSI values not consistent with the number of reads reported on splicegraphs
This is the command for Moccasin, should I have reported other arguments for correction?

Moccasin Code k=2

!python moccasin.py -U 2 /Users/Alessandra/Desktop/model_matrix.tsv /Users/Alessandra/Desktop/build /Users/Alessandra/Desktop/NewDelta2

Another doubt is about the splicegraph file itself: does the file need to be modified somehow after Moccasin correction or do I use the original one created at the beginning by Majiq Builder?

Many thanks in advance,

Alessandra

bsl...@seas.upenn.edu

unread,

Jul 3, 2024, 1:01:38 PM7/3/24

to Biociphers

Hi Alessandra,

Thank you for raising this question. Today, as you said, Moccasin adjusts the read rate data in .majiq files. The adjusted read rates are used by the Quantifier to calculate PSI. However, Moccasin does not update the splicegraph, which still uses the original (non-adjusted) read counts. Voila therefore uses the original read counts. In the future we plan to update Voila to additionally display read rate. This will reflect adjustments made by Moccasin if Moccasin was used.

I am curious if your results are otherwise unexpected. One thought is that you might include an "intercept" column (value 1 for each sample) in the model matrix. Does a low-dimensional clustering (e.g. two PCA components) with the non-adjusted vs adjusted sample data reveal surprising changes?

Barry

Alessandra Longo

unread,

Jul 4, 2024, 8:35:21 AM7/4/24

to Biociphers

Hi Barry,
thank you very much for your quick reply,
considering that Voila refers to the original and uncorrected read rates,could you recommend me an alternative to Voila to handle and display the corrected PSI and dPSI results?
Despite the incorrect view of the results with Voila, for what concern the outcome of the correction itself, by adding the intercept in the Model Matrix the controls and treated samples appear to be well separated . However looking at Moccasin's vignette I have an additional doubt: for the type of correction I want to perform should I use only the -U flag or also the -F option?

Kind regards,
Alessandra

bsl...@seas.upenn.edu

unread,

Jul 5, 2024, 3:48:08 PM7/5/24

to Biociphers

Hi Alessandra,

Today, I have no recommended alternative to Voila. In the future we plan to update Voila to additionally display read rate. This will reflect adjustments made by Moccasin if Moccasin was used. To be clear, Voila already does display adjusted PSI and dPSI, together with the original (non-adjusted) read counts.

If you really wanted to dig into it-- the .majiq files themselves include the read rates and genomic coordinate information for each LSV in each RNA-seq experiment. You could extract this data and transform it to use with another visualization tool such as IGV or Matplotlib. We don't formally support extracting from internals in this way, but it does contain the data you're interested in, so let me know if you want to go this route and I can share more details.

For your use-case, it should not make a difference whether you use the -F flag or not. The -F flag tells Moccasin to use only the named confounding factor columns and "intercept" column (if included by the user, with "intercept" spelled that way) from the model matrix for linear modeling. In other words, the -F flag will cause Moccasin to ignore all other (non-confounder / covariate) columns from the model matrix. But since you aren't including additional covariate columns, it should not make a difference for you.

Barry

Alessandra Longo

unread,

Jul 8, 2024, 5:12:56 AM7/8/24

to Biociphers

Hi Barry

Thank you very much for your suggestions, everything is much clearer to me now.
Since I would like to graphically represent the most significant results obtained could I ask you more details on how to proceed having the correct .majiq files and related deltaPSI.tsv available?

Best regards,
Alessandra

bsl...@seas.upenn.edu

unread,

Jul 8, 2024, 12:22:54 PM7/8/24

to Biociphers

Hi Alessandra,

The .majiq files are numpy files which were written using "savez" (see: https://numpy.org/devdocs/user/how-to-io.html#binary ). They can be opened in python using numpy.load. The key 'lsv_types' gives a numpy array with the LSV id and description of each variation (junctions/introns). The key ‘junc_info’ gives a numpy array with one row per junction/intron; each row gives the LSV id (f0), and the junction/intron start (f1) and end (f2) coordinate. The ‘coverage’ key is the array of bootstrap read rates. Specifically, each row is a splice junction/intron (same order as the junc_info rows) and each column is a bootstrap replicate index (default 30 for MAJIQ). The row i, column j entry is the read rate calculated for the i’th junction and j’th bootstrap. The MAJIQ Quantifier considers variance in the bootstrap read rates (i.e., across columns) to model the posterior PSI / dPSI distributions. The PSI and DeltaPSI tsv outputs also include the LSV id and junction/intron start-end coordinates, so you could join the read rates with corresponding DeltaPSI results using the LSV id and junc start-end as keys.

My disclaimer is that these are all MAJIQ internals and will not remain consistent in future MAJIQ versions.

Barry

Reply all

Reply to author

Forward