Comparing large number of samples

Ömer An

unread,

Jun 8, 2021, 2:39:45 AM6/8/21

to rMATS User Group

Hi,

Thanks for the great tool. I have 3 questions related:

1. Does it make sense to use rMATS to compare hundreds of samples? For example, I compare 374 tumor samples as the test vs 50 normal samples as the control from TCGA LIHC. Can I expect any significant events in the output?

2. For the above comparison, alignment finishes successfully as it is sequential. However, calculation step takes forever with very high usage of memory (> 125 GB). Either I have to kill the script or it goes to state "D" and gets stuck there, hence I never get results. Any suggestion how to overcome this issue?

3. The summary output is as below. What does negative numbers mean? Does the summary statistics overall look ok?

Thanks a lot.

gtf: 7.330986976623535
There are 28207 distinct gene ID in the gtf file
There are 74263 distinct transcript ID in the gtf file
There are 14973 one-transcript genes in the gtf file
There are 752460 exons in the gtf file
There are 5281 one-exon transcripts in the gtf file
There are 4134 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 2.632786
Average number of exons per transcript is 10.132367
Average number of exons per transcript excluding one-exon tx is 10.831507
Average number of gene per geneGroup is 41.116693
statistic: 0.018664121627807617

read outcome totals across all BAMs
USED: -1111027985
NOT_PAIRED: 0
NOT_NH_1: -1751379762
NOT_EXPECTED_CIGAR: 143890078
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 26468822
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 18109667
CLIPPED: 0
total: 1621028116
outcomes by BAM written to: rMATS/temp/2021-06-03-13:28:59_099530_read_outcomes_by_bam.txt

novel: 37883.14090299606

kutsc...@gmail.com

unread,

Jun 8, 2021, 9:25:07 AM6/8/21

to rMATS User Group

1. Comparing a large number of samples makes sense. With a large number of samples in each group rMATS should be better able to estimate the variation within each group. I would expect there to still be plenty of significant events in the output. But there could be situations where some subsets of the samples in a group have a low variance for certain events, but there is a high variance when looking at all the samples for that group. In those cases, the events may (correctly) not be marked significant with all samples, even though it was significant for a subset of samples

2. The last line of your output is: 'novel: 37883.14090299606'. Since it didn't print a line like: 'save: {time}' it seems that rMATS is stuck writing the .rmats files. That shouldn't take a long time unless there is some issue with the filesystem (that could be the case if you see state "D"). It could be that rMATS is actually past that part, but nothing was printed yet because the output is buffered. You can get rid of buffering by running with 'python -u'. The only reasons I would expect high memory usage is if there are a large number of threads, or --novelSS is used. --novelSS would use more memory with more samples since there are potentially more reads supporting novel splice sites

3. The negative numbers are due to integer overflow. Only the printed summary should be incorrect. The output files should all have correct information as long as each individual bam file did not exceed ~2 billion alignments. You should see more useful numbers in: rMATS/temp/2021-06-03-13:28:59_099530_read_outcomes_by_bam.txt. Here's a related post: https://github.com/Xinglab/rmats-turbo/issues/123#issuecomment-847048886

Eric

Thomas Danhorn

unread,

Jun 8, 2021, 9:40:46 AM6/8/21

to kutsc...@gmail.com, rMATS User Group

I am confused about the 32-bit integer overflow -- any system used for
bioinformatics should be 64-bit, and Python 3 does not have a "maxint" and
the limitations of Python 2. I can calculate with numbers much larger
than 2^64 (let alone 2^32 - 1) and print them:

$ python3 <<< "import sys; print('{}'.format(sys.maxsize * 1000))"
9223372036854775807000

Is there an artificial limitation in the rMATS code that could be removed?

Thanks,

Thomas

> --
> You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/325b53f2-3295-4902-976f-d7e8effea944n%40googlegroups.com.
>

kutsc...@gmail.com

unread,

Jun 8, 2021, 10:10:29 AM6/8/21

to rMATS User Group

Apologies if this is a duplicate. I think I replied once directly to Thomas instead of to this group

Yes, the 32 bit limit is artificial. Here is a link to the code: https://github.com/Xinglab/rmats-turbo/blob/v4.1.1/rMATS_pipeline/rmatspipeline/rmatspipeline.pyx#L841

Part of rMATS is written in Cython which is compiled as C++. Plain "int" is used instead of something like "int64_t". For the most part this is not an issue since rMATS is counting things at the event or gene level and usually with a separate count per bam file. As far as I know the only issue is when things are aggregated to generate the "read outcome" summary which is just for debugging. Even in that case, I haven't heard of an issue with the more detailed {date}_read_outcomes_by_bam.txt file, so this issue is not a high priority for me right now. All the usages of "int" in the code could probably just be replaced with a 64 bit int without issue, but that change would need to be tested

Eric

Reply all

Reply to author

Forward