Error with rMATS --paired-stats model output

557 views
Skip to first unread message

Matthew Romero

unread,
Jul 28, 2020, 11:22:36 PM7/28/20
to rMATS User Group
Hello, I was able to successfully run rMATS turbo on two samples from a differentiated and progenitor rna-seq data (FASTQ files). When running these samples I did not include the --paired-stats model parameter. My first question is that does this mean the output from this does not reflect the difference in splicing between the two samples, but rather with the default --cstat cutoff of 0.01%?

Next, I re-ran rMATS two include 2 other replicates for each the differentiated and progenitor samples, specifying so in the s1.txt and s2.txt variables, and including the --paired-stats model for this run. 
My second question is can you specify the replicates and pairs for the paired end model when using FASTQ files? Or does it have to be for BAM files only?
I ask this because the program ran for sometime before outputting the following files, which is different from the non --paired-stats run which had additional files for A3SS, A5SS, SE, MXE, RI of .MATS.JC and .MATS.JCEC.  Is this the correct output for the rMATS --paired-stats model or did my run terminate prematurely?


Eric Kutschera

unread,
Jul 29, 2020, 10:15:31 AM7/29/20
to rMATS User Group
Without --paired-stats, the default unpaired statistical model will be used to compare the two samples. The default statistical model calculates a PValue and FDR where the PValue is the estimated probability that the true difference between the inclusion levels of the two samples is less than --cstat. So the default statistical model is comparing the splicing between the two samples. It's using --cstat to essentially say that any splicing difference between the two samples of less than 0.01% is not meaningful. The paired stats model does not use --cstat and instead considers any difference in splicing to be potentially meaningful. Both statistical models are comparing the two samples

You can specify the replicates that are paired for --paired-stats when using either FASTQ or BAM inputs. You just need to make sure that the first replicate in s1.txt is supposed to be paired with the first replicate in s2.txt and similarly for the second, third, ... replicates in s1.txt and s2.txt. The output files should include .MATS.JC and .MATS.JCEC regardless of whether --paired-stats is given as part of the command. The files in your screenshot are produced in the earlier parts of the rMATS execution. Those files are then used as input to the statistical model (either default or paired). Since the .MATS files are missing, it seems that the statistical model did not complete successfully. The paired model is much slower so maybe it was still running when you took the screenshot. If there was an error it should have been printed to stderr

Eric

Matthew Romero

unread,
Jul 29, 2020, 2:19:30 PM7/29/20
to rMATS User Group
Thank you for your reply, I understand the answer to my first question now.

In regards to my second, the program was submitted to an HPC cluster and the job was terminated - leaving these as the only output files, so I believe some error occurred as the job was not running anymore at the time I obtained this screenshot. I was not able to find/see the error printed in stderr as my script directs error to a specific file, so I am unclear on what went wrong and where to look for this error. I have attached below the error file that I have been able to find. I was hoping you could advise me on where to look for the standard error file and/or how to address this issue, as the job seems to have stopped before finishing the statistical model. Thank you in advance.

Matthew Romero

unread,
Jul 29, 2020, 2:22:24 PM7/29/20
to rMATS User Group
I cannot see my screenshot attached so I will attach the txt file below.
Error_file.txt

Thomas Danhorn

unread,
Jul 29, 2020, 2:34:27 PM7/29/20
to Matthew Romero, rMATS User Group
The error message say that `Rscript' count not be found. This an
executable of your R installation (being called from within the Python
script), and should be in your PATH. Do you need to load some module to
have access to R (I am assuming you have R installed)?

On a side note, if a job uses too much time or too many resources (e.g.
RAM), a cluster's scheduler might kill it, and you might not see anything
more specific than "job terminated", although there may be information
that tells you the total of each resource used. If this happnes, it is
not a problem of the program you ran, but of the cluster's configuration
and available resources.

Hope that helps,

Thomas


On Wed, 29 Jul 2020, Matthew Romero wrote:

> I cannot see my screenshot attached so I will attach the txt file below.
>
> On Wednesday, July 29, 2020 at 11:19:30 AM UTC-7, Matthew Romero wrote:
>>
>> Thank you for your reply, I understand the answer to my first question now.
>>
>> In regards to my second, the program was submitted to an HPC cluster and
>> the job was terminated - leaving these as the only output files, so I
>> believe some error occurred as the job was not running anymore at the time
>> I obtained this screenshot. I was not able to find/see the error printed in
>> stderr as my script directs error to a specific file, so I am unclear on
>> what went wrong and where to look for this error. I have attached below the
>> error file that I have been able to find. I was hoping you could advise me
>> on where to look for the standard error file and/or how to address this
>> issue, as the job seems to have stopped before finishing the statistical
>> model. Thank you in advance.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/194b78f5-9c48-4211-9dd1-44707a2c850eo%40googlegroups.com.
>

Matthew Romero

unread,
Jul 29, 2020, 2:48:30 PM7/29/20
to rMATS User Group
I believe the resources requested were sufficient for the job, but I will double check that for the next submission. 

In regards to R, is this software used in the non-paired stats model? As this error was not encountered when I ran the earlier job. 
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-us...@googlegroups.com.

Eric Kutschera

unread,
Jul 29, 2020, 2:53:20 PM7/29/20
to rMATS User Group
The paired model is https://github.com/Xinglab/PAIRADISE which is implemented in R. The default non-paired stats model is implemented in C

To use --paired-stats, in addition to having R installed, you'll also need to have PAIRADISE installed. Here's how the ./build_rmats script installs PAIRADISE: https://github.com/Xinglab/rmats-turbo/blob/master/build_rmats#L65

It looks like you have rmats installed in a conda environment. With that conda environment activated you could try the following:
git clone https://github.com/Xinglab/rmats-turbo.git
cd rmats-turbo
git clone https://github.com/Xinglab/PAIRADISE.git
conda install -c conda-forge r-base==3.6.2
Rscript install_r_deps.R

Eric

Matthew Romero

unread,
Jul 29, 2020, 3:11:29 PM7/29/20
to rMATS User Group
That makes sense, I did not have the proper dependencies installed for the paired model. Thank you for the detailed response, I will install them right now.

I have the .rmats file from the failed run in my temp directory currently, can I re-run from the post step using this .rmats file after having the paired model dependencies installed? Or will I have to restart this run?
If so, would I have to move this file and the bam files to another directory and update the --b1 and --b2 txt files for this run?

Eric Kutschera

unread,
Jul 29, 2020, 3:28:25 PM7/29/20
to rMATS User Group
You should be able to re-run with --task post. You can use the same --tmp directory as before since it has the .rmats file. You don't need to move the .rmats file to a new directory (but you could if you wanted to). The --b1 and --b2 txt files should not be changed since the paths to the bam files from --b1 and --b2 are used as strings to access the data in the .rmats file (the bam files themselves won't be accessed in the post step).

Eric

Matthew Romero

unread,
Jul 29, 2020, 5:20:28 PM7/29/20
to rMATS User Group
I had run rMATS  previously for the default statistical model in the base environment, which worked, because when activating the rmats environment in conda_envs, I keep getting the error

  File "/home/mrr014/anaconda3/bin/rmats.py", line 16, in <module>
    from rmatspipeline import run_pipe
ModuleNotFoundError: No module named 'rmatspipeline'

However, when following the commands you provided for installing the dependencies for the paired-stats model I could only successfully do so within the rmats environment, but it brings me back to this earlier error of PYTHONPATH not being able to locate rmatspipeline even after I have added it to PYTHONPATH in my .bashrc file. I was wondering if you had any advice on how to address this issue?

Eric Kutschera

unread,
Jul 30, 2020, 8:26:52 AM7/30/20
to rMATS User Group
It sounds like you have rMATS installed in the base conda environment but PAIRADISE installed in a different conda environment (maybe with different Python versions which could cause the ModuleNotFoundError)

You need an environment with both rMATS and PAIRADISE installed. You could try installing both in a fresh conda environment, or you could try installing rMATS to the environment that already has PAIRADISE with: conda install -c conda-forge -c bioconda rmats=4.1.0

Eric

Pedro Barbosa

unread,
Jul 30, 2020, 7:01:31 PM7/30/20
to rMATS User Group
Hi,

I'l like to ask a subsequent question on this topic. I'm just testing this new option in rMATS (--paired-stats) for a paired dataset (3 replicates each group) I have recently analysed.
PAIRADISE is taking too long to run. Is is usually that slow ?? I mean, it's running for 24h in a fairly good cluster node, and It didn't even finish to process SE events (last line of pairadise status file: ExonID 42177.[28% completed]). Is there any option to speed up this process ?

Best,
pedro

Eric Kutschera

unread,
Jul 31, 2020, 8:29:57 AM7/31/20
to rMATS User Group
PAIRADISE is known to be slow and there is an open issue on GitHub: https://github.com/Xinglab/rmats-turbo/issues/5

One thing that can help is setting --tstat which is the number of threads for the statistical model. It is separate from --nthread which only applies to the event detection and counting steps of rMATS. At the time of the 4.1.0 release there was an unnecessary limit of 8 for --tstat. That limit has since been removed:
https://github.com/Xinglab/rmats-turbo/commit/5de46978a45654e05541920457520c794672a93f
https://github.com/Xinglab/PAIRADISE/pull/3/files

Eric

Matthew Romero

unread,
Jul 31, 2020, 11:45:10 AM7/31/20
to rMATS User Group
Hello Eric, you were correct, I had rmats installed in the base environment and not the rmats, installing it once the environment was activated solved my problems and the program is now running.

And to follow up on Pedro's point, for my task using the paired stats statistical model on 3 replicates for 2 samples, running just the post task, I submitted it for 12 hours with 1node and 6 cores which still was not enough time. Is the way to potentially speed up this process to increase -tstat to match the number of cores I request on the node if the default value is 1? Also, I submit rMATS as a batch script to the cluster but cannot seem to locate a file that tells me how far along this process is, so I am unsure of how much more time to give this process relative to the initial 12 hours - any ideas on where this type of output file might be? Or ways to specify within the input for rMATS as to obtain one? Thanks again for all your help.

Eric Kutschera

unread,
Jul 31, 2020, 12:46:56 PM7/31/20
to rMATS User Group
Yes, you can potentially speed up this process by increasing --tstat to match the number of cores you request on the node. It may make sense for me to change the code so that if --tstat is not given then it will use the value of --nthread. Hopefully that will prevent some confusion since users will only have to worry about one parameter related to the number of threads

The pairadise_status.txt file gets written to the working directory that was being used when rmats was started. The working directory depends on how your cluster operates and what parameters you used to submit the job. Right now there is no way to tell rMATS where to write that file. It may make sense to change the code so that the file gets written to either the --od or --tmp directory

Eric

Pedro Barbosa

unread,
Jul 31, 2020, 2:42:21 PM7/31/20
to rMATS User Group
Thanks for the quick reply. Indeed, i was not using --tstat option, which seems to have some positive effect. Still, it is slow and I hope PAIRADISE team improves the package on that regard. Meanwhile, I need to make sure if it's worth the wait, otherwise I'll just use rMATS normally, even if the dataset is paired.

Best,
Pedro

Pedro Barbosa

unread,
Jul 31, 2020, 2:43:34 PM7/31/20
to rMATS User Group
"Worth the wait" - I meant rMATs running time, not waiting for improvements in the package :)

Matthew Romero

unread,
Aug 6, 2020, 1:16:19 PM8/6/20
to rMATS User Group
Hello again, I have been able to run the paired stats model on the 2 samples with 3 replicates from start to finish with all files successfully output- however, I have some irregularities in my data. I have been examining events with FDR values for Skipped Exon output  (SE.JCEC.MATS) with FDR < 0.05, in order to compare with output from other collaborators with my lab. 

The column  "SJC_SAMPLE_1" appears to have a large majority of values "0,0,0" for this range of FDR, whereas the other similar columns seem much more balanced/normal. I was wondering if there was potentially some known cause for why this particular column would be artificially skewed? Or if this indicates that something went wrong with the run?

I posted this here because it continued on the same run/dataset discussed in this thread, and did not think it was worth beginning a new thread to discuss it. Any advice would be greatly appreciated, thanks!

Matthew Romero

unread,
Aug 11, 2020, 2:07:32 PM8/11/20
to rMATS User Group
Hello, just posting here again because I have not heard back yet.

I consulted a different thread where all columns for IJC and SJC were 0's, which suggested that maybe if the data was stranded and the wrong strand given this would cause an error. However, the data I am using is not stranded so this should not be the problem. Second, it suggested if the annotations were different format than the BAM files then the mismatch would result in no counts. I don't see this as the case since the other three columns have values for their counts for each of the three replicates. 

The only irregularity I see with my pattern is that occasionally when SJC_SAMPLE_1 is not "0,0,0" then IJC_SAMPLE_1 will be "0,0,0". But for the most part, the SJC_SAMPLE_1 column consists of zeros for my best data points and I am still wondering what I can do to remedy this? 

Thomas Danhorn

unread,
Aug 11, 2020, 4:35:23 PM8/11/20
to Matthew Romero, rMATS User Group
Can you take some junction(s) and see manually what the count(s) should be
(e.g. count reads in IGV or with something like bedtools or "samtools
view -c my.bam chr:start-end")? Then you can compare that to what rMATS
gets ... If there is an unexoected discrepancy, it might be worthwhile
filing a bug report.

Good luck!

Thomas
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/29d2b082-c2fc-4d83-a156-e6f45818b5c4o%40googlegroups.com.
>
Reply all
Reply to author
Forward
0 new messages