MAJIQ output

Miriam Martínez

unread,

May 18, 2023, 9:15:53 AM5/18/23

to Biociphers

Dear biociphers group ,

I have recently started in the splicing analysis field and I have been using MAJIQ to

do splicing analysis with our lab data but unfortunately I am unable to fully comprehend the results that MAJIQ outputs. Therefore, I get in contact with you seeking some help with the output of MAJIQ. I would also like to use MOCCASIN too afterwards, but I need to understand MAJIQ output and info first. To give some context, I have done a deltaPSI
analysis between 2 groups with 15-16 samples each group. I have checked the documentation and videos on the web, but the resources I have found seem to be from an older version, or when I check the v2 (the one I have used), the output elements stated in the guide, doesn't seem to resemble the files I have obtained. An example would be the columns of the summary file or the names of folders or files obtained. Maybe there is a
guide updated recently but I havent been able to find it.

On the other hand, another doubt that arises when checking the output files is the difference between the information obtained with the deltapsi file and the summary file. I don't know how the interpretation of the results obtained should be done with the data of the output files. I am unable to find where the detaPSI info is or if the changes found are significant or not. If you could please guide me through the info of the output files of MAJIQ, I would be really greatful. Again, thank you very much for your time and I hope my questions have been clear enough. If any question is unclear, please let me know and I will
try to explain myself in another way.

Best regards,

Miriam Martínez

San Jewell

unread,

May 18, 2023, 11:18:03 AM5/18/23

to Biociphers

Hello Miriam,

I apologize for the documentation possible being difficult to find, we have had some issues with our web hosting service over the past week and some pages might have been backdated. Please first look at this documentation site: https://biociphers.bitbucket.io/majiq-docs-academic/gallery/heterogen-vignette.html The main guide, and the other pages, to see if they help.

As for the specific questions, can you let me know exactly which pipeline you are running? For example, are you comparing the results of $ voila tsv with $ voila modulize ? Which specific columns would you like me to elaborate on?

Default mode for tsv and modulize only looks for significantly changing events. Please see for example $ voila modulize --help for comprehensive explanation of all of the filters used to mark something as significantly changing.

Please let me know if this helps

-San

Miriam Martínez

unread,

May 19, 2023, 5:27:58 AM5/19/23

to Biociphers

Hello San,

yesterday I wrote an answer but it doesn't appear, so I will write it again. I apologize if it appears duplicated.

First of all, thank you for the guide link, it was very helpful. As you point out on your email, I was comparing the output files of voila tsv and voila modulize, but reading the guide I realized they provide different info. If I understood correctly, voila tsv provides info about the splicng events, while voila modulize provides info about the type of event, please correct me if I'm wrong.

Secondly, I saw on the guide that the example used voila heterogen for the analysis. I used deltaPSI because I wanted to do the splicing analysis between the groups I stated in my first email, but if heterogen is able to do the analysis between two groups, would you recommend to keep on with the deltaPSI option or would you recommend me to use the heterogen option? I've seen this last one does various statistical tests.

Regarding my specific questions, the command I used to generate the tsv was voila tsv -j 4 -f splicing/MAJIQ/voila_output.tsv splicing/MAJIQ. If possible, I would like to ask you to elaborate on the following columns:

- probability_changing

- probability_non_changing

I would like to understand which values should I take into account to check if there is an splicing event occuring on group or another. I assume that the values of dpsi for each lsv are on the mean_dpsi_per_lsv_junction column, but I don't finish to understand how should I consider the columns above. As you state on your last email, default mode for tsv and modulize only looks for significantly changing events, but is there any value on the analysis that gives the statistical significance of each event? If not, should it be calculated separately with a particular statistical test?

Again, thank you very much for your time and I apologize if these are very elemental questions.

Best,

Miriam Martinez

San Jewell

unread,

May 23, 2023, 12:43:03 PM5/23/23

to Biociphers

Hi Miriam,

I too have noticed weirdness with google groups from time to time. I would recommend saving your message before hitting post, in case you need to write up a longer reply again.

If I understood correctly, voila tsv provides info about the splicng events, while voila modulize provides info about the type of event, please correct me if I'm wrong.

There is quite a lot of overlap between the functions. voila TSV provides a good one file high level overview of all of the detected LSVs, which, by default as you mention, sticks to changing events unless --show-all is used. Modulize is much more comprehensive. It has many more filtering options, and as you mention, it groups everything by event type inside modules. You get a summary of the event types distribution and also a breakdown / quantification per event type. This mode also allows you to load in any number of voila files to compare all of the columns side by side.

would you recommend to keep on with the deltaPSI option or would you recommend me to use the heterogen option?

When comparing between just two groups, I think deltaPSI is probably still your best option. They both use statistical tests, but the deltaPSI version is geared more towards a slightly legacy use case with only two cases being compared, which sounds more in line with your usage. We were/are in the process of developing another tutorial similar to the one I linked to previously which is tailored to a usage case for deltaPSI instead of heterogen.

Regarding my specific questions, the command I used to generate the tsv was voila tsv -j 4 -f splicing/MAJIQ/voila_output.tsv splicing/MAJIQ. If possible, I would like to ask you to elaborate on the following columns:

- probability_changing

- probability_non_changing

The calculation here is difficult to summarize, but I can point you to the code if you'd like to see our method. It basically is a matrix area function that estimates P( -t < dPSI < t) for some threshold t > 0 which is specified by the threshold argument.

I would like to understand which values should I take into account to check if there is an splicing event occuring on group or another. I assume that the values of dpsi for each lsv are on the mean_dpsi_per_lsv_junction column, but I don't finish to understand how should I consider the columns above.

As you would expect, High mean DPSI has a higher probably changing and very low DPSI has a higher probability non-changing. The main difference if that the mean difference over groups is a simpler calculation (difference in beta priors), whereas the probability columns are probably a better bet to base a takeaway on as they are stricter in general. For example, If you are looking for highly changing events, I would filter your table to use a value of the probability_changing >= threshold, rather than just looking for mean_dpsi >= threshold. I am myself a bit fuzzy on the details of this use case as I was not the original designer, so I've also asked for some help from my labmates to respond.

is there any value on the analysis that gives the statistical significance of each event? If not, should it be calculated separately with a particular statistical test?

At the moment, there is only significantly changing and significantly non changing, but not a combined version. What significance did you have in mind that might help with your analysis? I can pass it back and get some ideas rolling around if it might make sense for a future version.

Always feel free to ask, that's why we set up the board!

Thanks,

-San

Miriam Martínez

unread,

Jun 16, 2023, 6:53:44 AM6/16/23

to Biociphers

Hi San,

Sorry for the belated reply. I had to leave the splicing analysis aside and couldn't continue with it until now.

First of all, thank you for the explanation above and your time. Checking again my files and the guides, some doubts arise, I hope they are not repetitive. Starting from a deltapsi analysis, when using voila tsv, does this "filter" the file obtained from majiq deltapsi using specific thresholds? I ask this because when I check the results from deltapsi, I obtain a big list, but after using voila tsv, this list reduces considerably. Secondly, when using voila modulize, as you mention above, it has many more filtering options, so it should be normal that it shows less events right? (at least in my files, there are less events shown in this file than in the one generated by voila tsv). Therefore, if the events shown are different, which events from which file (majiq deltapsi output, voila tsv output or voila modulize output) should be really considered for further analysis? I hope my question is clear, if not, please let me know.

About the code to see your method, that would be great!

Regarding the changing event, I am not looking for big differences as my study field is mental disorders and I don't expect to be very big differences. I was using the mean dpsi to check for differences because I'm using a multi tool approach and in the other tools they use the mean dpsi to check the splicing events, so I wanted to have a common metric for the filter I use (appart from other specific filters if necessary for each tool). With MAJIQ you add the probability of changing, so I don't really know how should really filter to have a comparable filter with the other tools I have used (maybe using both, the probability and mean dpsi would be a good idea, but I would like to know your opinion as the developers of the tool).

Lastly, about the significance, I am not a statistician, so I don't really have a deep knowledge about the possible analysis that should be done with this type of data. I mainly asked because for example, in rMATS, the analysis gives a pvalue for each event, which indicates if the event identified is statically significant or not. Having this in mind is why I asked if there was any value that gave the significance like this (my first guess was the probability of changing or not changing values, but i don't know if they can be used as a value for stating statistical significance of each event or not).

Thank you very much for your time and efforts. I hope the messages and questions are clear. I will be looking foward to your response!

Miriam

San Jewell

unread,

Jun 20, 2023, 11:42:12 AM6/20/23

to Biociphers

Hi Miriam,

Thanks for the reply, any no worries about delays.

First of all, thank you for the explanation above and your time. Checking again my files and the guides, some doubts arise, I hope they are not repetitive. Starting from a deltapsi analysis, when using voila tsv, does this "filter" the file obtained from majiq deltapsi using specific thresholds? I ask this because when I check the results from deltapsi, I obtain a big list, but after using voila tsv, this list reduces considerably. Secondly, when using voila modulize, as you mention above, it has many more filtering options, so it should be normal that it shows less events right? (at least in my files, there are less events shown in this file than in the one generated by voila tsv). Therefore, if the events shown are different, which events from which file (majiq deltapsi output, voila tsv output or voila modulize output) should be really considered for further analysis? I hope my question is clear, if not, please let me know.

The pipeline majiq deltapsi which dumps a majiq.tsv file, is a list of _all_ LSVs that deltapsi records, without any filtering beyond what is specified for $ majiq deltapsi and $ majiq build. The voila programs allow more fine tuning on the filtering in terms of probability thresholds and deltapsi/psi thresholds. In addition to the filters, by default $ voila tsv also only dumps _changing_ lsvs. I believe this last point is why most of them are being removed, as the lsvs must pass the dpsi and significance criterion in order to be included in the output. You can use --show-all to skip this filter.

About the code to see your method, that would be great!

Sure, here is the link to the function:

https://bitbucket.org/biociphers/majiq_academic/src/2cae15073a068bffa0bb627e7717629c59cf1b92/voila/rna_voila/vlsv.py#lines-55

(where the input "matrix" is the binned values of the beta prior function of the dpsi)

Regarding the changing event, I am not looking for big differences as my study field is mental disorders and I don't expect to be very big differences. I was using the mean dpsi to check for differences because I'm using a multi tool approach and in the other tools they use the mean dpsi to check the splicing events, so I wanted to have a common metric for the filter I use (appart from other specific filters if necessary for each tool). With MAJIQ you add the probability of changing, so I don't really know how should really filter to have a comparable filter with the other tools I have used (maybe using both, the probability and mean dpsi would be a good idea, but I would like to know your opinion as the developers of the tool).

Lastly, about the significance, I am not a statistician, so I don't really have a deep knowledge about the possible analysis that should be done with this type of data. I mainly asked because for example, in rMATS, the analysis gives a pvalue for each event, which indicates if the event identified is statically significant or not. Having this in mind is why I asked if there was any value that gave the significance like this (my first guess was the probability of changing or not changing values, but i don't know if they can be used as a value for stating statistical significance of each event or not).

I'm also not a statistician, so I'm going to ask again that someone else can help to follow up here. However, I think we generally came to the conclusion that depending on dpsi alone wasn't enough, we needed to check for significance, and the probability values provided in these cases can be used as P-value. If the other tools you are testing so not suppose significance testing, are you planning to write something to accomplish it in post-processing?

Thanks,

-San

wan...@biociphers.org

unread,

Jun 20, 2023, 12:29:39 PM6/20/23

to Biociphers

Hi Miriam,

Please try running your analysis again using "majiq heterogen" instead of "majiq deltapsi". Deltapsi is for analysis involving replicates. Heterogen is for cohort comparisons and will give you output with p-values.

Thanks,

David

Miriam Martínez

unread,

Jun 21, 2023, 9:09:04 AM6/21/23

to Biociphers

Hi San and David,

Firstly, thank you very much San for all the explanations both in the last and other emails, they are very clarifying. Also, thank you for the code reference. Regarding the statistical part, as you ask me: "If the other tools you are testing so not suppose significance testing, are you planning to write something to accomplish it in post-processing?" When I use other tools (at least until now, I haven't used a wide variety as I'm barely starting in alternative splicing), in the output I have the information of the psi values for each individual that conform each group. From there I can check if the group of samples follow some assumptions (normal distribution, variance, etc.) and then, from those tests, select which statistical test is appropiate for the data I have. That is what I have done until now to check if the results I obtained, the dPSIs are significant or not.

Lastly, thank you for your suggestion David, I will run my analysis with majiq heterogen then and check the results from there.

I hope you have a nice day!

Best,

Miriam

Miriam Martínez

unread,

Jun 30, 2023, 7:45:38 AM6/30/23

to Biociphers

Hi again,

after running my experiment with heterogen, now i get the statistics I wanted, thank you very much for the solution. Once I can get the output with the statistical tests, for voila tsv I tried to use the parameter --changing-between-group-dpsi to check events on different dpsi. I tried with --changing-between-group-dpsi 0.1 and --changing-between-group-dpsi 0.05. Doing this (without the --show-all parameter), I get more than double the events with 0.1 than with 0.05, and I don't understand why as, at my understanding, the 0.05 should be more permissive than the 0.1 threshold. I want 2 files, one that check events that have at least 0.1 dpsi difference (10%) and another file that check events that have at least 0.05 dpsi difference (5%). Thank you very much in advance!

Best,

Miriam

San Jewell

unread,

Jul 3, 2023, 10:40:05 AM7/3/23

to Biociphers

Hi Miriam,

I've investigated this issue and I've found a slight problem with the row-filtering in voila tsv mode with het inputs. Basically, the only row filtering is done with --probability-threshold, and not on --changing-between-group-dpsi ; I believe after the lab mostly switched to using modulizer, with HET, the filtering was not properly ported over to voila TSV mode. I will make a discussion to verify that the code changes for the next patch will make this clearer and functional.

In the meantime, what I would recommend it to _not_ specify --probability-threshold , and instead you should run your scripts, one with --changing-between-group-dpsi 0.05 and --changing-between-group-dpsi 0.1 ; this should give both outputs with the same number of rows. For here you can filter on the text "True" existing in the column "changing". This filter should provide more rows in the 0.05 file vs the 0.1 so you could get what you desire. Let me know if this would be possible.

Thanks!

-San

Miriam Martínez

unread,

Jul 4, 2023, 4:36:55 AM7/4/23

to Biociphers

Hi San,

thanks for checking in such detail, I hope it isn't a big issue and you can fix it without much trouble. Regarding your suggestion, doing that is what gave me the filtering problem because I didn't change the --probability-threshold parameter, it was left as default. The command I used was the following for checking dpsi > 0.1 (and the same for >0.05 but only changing the number at the --changing-between-group-dpsi parameter to 0.05 instead of 0.1):

For dpsi > 0.1: voila tsv -j 4 -f /path_to_output/voila_dpsi10.tsv --changing-between-group-dpsi 0.1 /path_to_splicegraph/splicegraph.sql /path_to_voila_file/Affected-Unaffected.het.voila

For dpsi > 0.05: voila tsv -j 4 -f /path_to_output/voila_dpsi5.tsv --changing-between-group-dpsi 0.05 /path_to_splicegraph/splicegraph.sql /path_to_voila_file/Affected-Unaffected.het.voila

Therefore, maybe I have misunderstood how to correctly use the parameters to get the output I want. I would really appreciate if you can confirm I used them correctly or maybe is another parameter the one I should change. Thank you in advance!

Best,

Miriam

San Jewell

unread,

Jul 5, 2023, 9:41:09 AM7/5/23

to Biociphers

Hi Miriam, what I was suggesting was basically a calc/excel method of getting what you needed as a stopgap. Spreadsheet programs that open TSV should be able to filter rows on "column contains some text". In this case when you change the --changing-between-group-dpsi parameter, it will change the values in the column "changing" If you filter on "True" being in that column, I believe it will provide what you wish.

(Note, here there is a True/False comma separated for each junction but a common logic is that the lsv is considered changing if _any_ junctions are changing, so just looking for "column contains True" should be good enough for your purposes.)

Miriam Martínez

unread,

Jul 6, 2023, 4:38:27 AM7/6/23

to Biociphers

Hi San, thanks for the clarification. However I have the same problem as stated before.

Regarding this statement: In the meantime, what I would recommend it to _not_ specify --probability-threshold , and instead you should run your scripts, one with --changing-between-group-dpsi 0.05 and --changing-between-group-dpsi 0.1 ; this should give both outputs with the same number of rows.

I actually did this as I write in ma last message, and I don't get the same number of rows even though I didn't change the probability threshold. Therefore, I can't filter as you suggest as the tsv are different. There are more rows when I use the parameter --changing-between-group-dpsi 0.1 than when I use 0.05 (which I understand it is more permissive). As this issue happens without changing the probability_threshold parameter, I don't understand how voila is filtering actually, so I am not sure if I can actually trust those tsvs. I don't know if maybe a solution meanwhile would be to use the --show-all parameter and then use the filtering you suggest? Or maybe there is another approach I'm not aware of?