v4.1.1 uses many fewer reads than v4.1.0

216 views
Skip to first unread message

Rupert Hugh-White

unread,
Mar 23, 2021, 6:16:44 PM3/23/21
to rMATS User Group
Hi all,

I'm finding very different results when using v4.1.1 vs v4.1.0.

Specifically, many fewer reads are reported as 'USED' in the 'read_outcome' log information, as well as many fewer events reported. The number of reads 'USED' decreases by an order of magnitude (these reads are instead reported as 'EXON_NOT_MATCHED_TO_ANNOTATION' or 'JUNCTION_NOT_MATCHED_TO_ANNOTATION'). Full commands and logs below.

In both cases I am using docker images built from the Xing lab provided Dockerfiles (e.g. this one). The input data (BAM file), GTF, and command used are the same in both cases.

Any ideas what may be happening here? I'm presuming this is not the expected behaviour for v4.1.1 vs v4.1.0?

Many thanks!

v4.1.0 command:

python /rmats/rmats.py     --b1 input_tmp.txt     --gtf "gencode.v24lift37.annotation.gtf"     --od ./.     --tmp ./.     --nthread 2     --libType  "fr-secondstrand"     -t "paired"     --readLength 100     --variable-read-length     --allow-clipping     --statoff     --novelSS
v4.1.0 log:
gtf: 22.8722820282
There are 60308 distinct gene ID in the gtf file
There are 198865 distinct transcript ID in the gtf file
There are 38478 one-transcript genes in the gtf file
There are 1179775 exons in the gtf file
There are 26480 one-exon transcripts in the gtf file
There are 24341 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 3.297490
Average number of exons per transcript is 5.932542
Average number of exons per transcript excluding one-exon tx is 6.690228
Average number of gene per geneGroup is 7.609373
statistic: 0.0323858261108

read outcome totals across all BAMs
USED: 82626882
NOT_PAIRED: 0
NOT_NH_1: 11535082
NOT_EXPECTED_CIGAR: 1730942
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 3319752
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 69990
CLIPPED: 0
total: 99282648
outcomes by BAM written to: ././2021-03-23-02:50:24_603943_read_outcomes_by_bam.txt

novel: 379.324129105
The splicing graph and candidate read have been saved into ././2021-03-23-02:50:24_603943_*.rmats
save: 19.8733680248
loadsg: 0.453102827072

==========
Done processing each gene from dictionary to compile AS events
Found 56119 exon skipping events
Found 4164 exon MX events
Found 28878 alt SS events
There are 17286 alt 3 SS events and 11592 alt 5 SS events.
Found 9092 RI events
==========

ase: 4.40710306168
count: 16.5859689713
Processing count files.
Done processing count files.

v4.1.1 command:

python /rmats/rmats.py     --b1 input_tmp.txt     --gtf "gencode.v24lift37.annotation.gtf"     --od ./.     --tmp ./.     --nthread 2     --libType  "fr-secondstrand"     -t "paired"     --readLength 100     --variable-read-length     --allow-clipping     --statoff     --novelSS

v4.1.1 log:

gtf: 999.722041845
There are 60308 distinct gene ID in the gtf file
There are 198865 distinct transcript ID in the gtf file
There are 38478 one-transcript genes in the gtf file
There are 1179775 exons in the gtf file
There are 26480 one-exon transcripts in the gtf file
There are 24341 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 3.297490
Average number of exons per transcript is 5.932542
Average number of exons per transcript excluding one-exon tx is 6.690228
Average number of gene per geneGroup is 7.609373
statistic: 0.0308849811554

read outcome totals across all BAMs
USED: 6452182
NOT_PAIRED: 0
NOT_NH_1: 11535082
NOT_EXPECTED_CIGAR: 1730942
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 69162039
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 10402403
CLIPPED: 0
total: 99282648
outcomes by BAM written to: ././2021-03-23-18:35:38_930608_read_outcomes_by_bam.txt

novel: 436.540826082
The splicing graph and candidate read have been saved into ././2021-03-23-18:35:38_930608_*.rmats
save: 0.377684116364
loadsg: 0.330173015594

==========
Done processing each gene from dictionary to compile AS events
Found 38686 exon skipping events
Found 2073 exon MX events
Found 12873 alt SS events
There are 7894 alt 3 SS events and 4979 alt 5 SS events.
Found 5834 RI events
==========

ase: 2.53935790062
count: 0.712833881378
Processing count files.
Done processing count files.

Rupert Hugh-White

unread,
Mar 23, 2021, 9:12:48 PM3/23/21
to rMATS User Group
UPDATE

This seems to be related to the  --libType  option. 
If I switch from "fr-secondstrand" to "fr-firststrand" with v4.1.1, the number of reads USED increases to a reasonable/expected value (81513611).
However, the opposite behaviour is seen with v4.1.0 - fr-firststrand gives a low read usage and fr-secondstrand a high read usage.

Has the implementation of  --libType options changed between v4.1.0 and v4.1.1?

To determine the libType option to use I ran rseqc infer_experiment.py and received the following results, from which I inferred I should use 'fr-secondstrand'. Is that correct?

This is PairEnd Data

Fraction of reads failed to determine: 0.0304

Fraction of reads explained by "1++,1--,2+-,2-+": 0.1754

Fraction of reads explained by "1+-,1-+,2++,2--": 0.7942

Thanks


kutsc...@gmail.com

unread,
Mar 24, 2021, 8:52:01 AM3/24/21
to rMATS User Group
Yes, the implementation of --libType has changed in v4.1.1. It was mentioned in the release notes as "Bug fix of strand specific read filtering": https://github.com/Xinglab/rmats-turbo/releases/tag/v4.1.1

Here's the code change: https://github.com/Xinglab/rmats-turbo/pull/86

Before that change, the strand was only checked for exon reads, not junction reads. The method for determining whether a particular alignment was consistent with an fr-firststrand library or fr-secondstrand library did not change

The link you provided for determining what to use for --libType (http://rseqc.sourceforge.net/#infer-experiment-py) says that "1++,1--,2+-,2-+" indicates that read 1 should map to the same strand as the gene and read 2 should map to the opposite strand of the gene. "1+-,1-+,2++,2--" indicates that read 1 should map to the opposite strand and read 2 should map to the same strand

The meaning of fr-firststrand and fr-secondstrand in rMATS is consistent with the description here: https://salmon.readthedocs.io/en/latest/library_type.html

fr-firststrand is the same as ISR which indicates that read 1 should map to the opposite strand of the gene and read 2 should map to the same strand as the gene. fr-secondstrand is the same as ISF which indicates that read 1 should map to the same strand as the gene and read 2 should map to the opposite strand

Based on those interpretations, it seems that "1++,1--,2+-,2-+" is the same as fr-secondstrand and that "1+-,1-+,2++,2--" is the same as fr-firststrand. Since the output on your data showed about 80% "1+-,1-+,2++,2--" then it is expected that rMATS v4.1.1 should use more reads with fr-firststrand

Eric
Reply all
Reply to author
Forward
0 new messages