v4.1.1 uses many fewer reads than v4.1.0

Rupert Hugh-White

unread,

Mar 23, 2021, 6:16:44 PM3/23/21

to rMATS User Group

Hi all,

I'm finding very different results when using v4.1.1 vs v4.1.0.

Specifically, many fewer reads are reported as 'USED' in the 'read_outcome' log information, as well as many fewer events reported. The number of reads 'USED' decreases by an order of magnitude (these reads are instead reported as 'EXON_NOT_MATCHED_TO_ANNOTATION' or 'JUNCTION_NOT_MATCHED_TO_ANNOTATION'). Full commands and logs below.

In both cases I am using docker images built from the Xing lab provided Dockerfiles (e.g. this one). The input data (BAM file), GTF, and command used are the same in both cases.

Any ideas what may be happening here? I'm presuming this is not the expected behaviour for v4.1.1 vs v4.1.0?

Many thanks!

v4.1.0 command:

python /rmats/rmats.py --b1 input_tmp.txt --gtf "gencode.v24lift37.annotation.gtf" --od ./. --tmp ./. --nthread 2 --libType "fr-secondstrand" -t "paired" --readLength 100 --variable-read-length --allow-clipping --statoff --novelSS

v4.1.0 log:

gtf: 22.8722820282

There are 60308 distinct gene ID in the gtf file

There are 198865 distinct transcript ID in the gtf file

There are 38478 one-transcript genes in the gtf file

There are 1179775 exons in the gtf file

There are 26480 one-exon transcripts in the gtf file

There are 24341 one-transcript genes with only one exon in the transcript

Average number of transcripts per gene is 3.297490

Average number of exons per transcript is 5.932542

Average number of exons per transcript excluding one-exon tx is 6.690228

Average number of gene per geneGroup is 7.609373

statistic: 0.0323858261108

read outcome totals across all BAMs

USED: 82626882

NOT_PAIRED: 0

NOT_NH_1: 11535082

NOT_EXPECTED_CIGAR: 1730942

NOT_EXPECTED_READ_LENGTH: 0

NOT_EXPECTED_STRAND: 0

EXON_NOT_MATCHED_TO_ANNOTATION: 3319752

JUNCTION_NOT_MATCHED_TO_ANNOTATION: 69990

CLIPPED: 0

total: 99282648

outcomes by BAM written to: ././2021-03-23-02:50:24_603943_read_outcomes_by_bam.txt

novel: 379.324129105

The splicing graph and candidate read have been saved into ././2021-03-23-02:50:24_603943_*.rmats

save: 19.8733680248

loadsg: 0.453102827072

==========

Done processing each gene from dictionary to compile AS events

Found 56119 exon skipping events

Found 4164 exon MX events

Found 28878 alt SS events

There are 17286 alt 3 SS events and 11592 alt 5 SS events.

Found 9092 RI events

==========

ase: 4.40710306168

count: 16.5859689713

Processing count files.

Done processing count files.

v4.1.1 command:

python /rmats/rmats.py --b1 input_tmp.txt --gtf "gencode.v24lift37.annotation.gtf" --od ./. --tmp ./. --nthread 2 --libType "fr-secondstrand" -t "paired" --readLength 100 --variable-read-length --allow-clipping --statoff --novelSS

v4.1.1 log:

gtf: 999.722041845

There are 60308 distinct gene ID in the gtf file

There are 198865 distinct transcript ID in the gtf file

There are 38478 one-transcript genes in the gtf file

There are 1179775 exons in the gtf file

There are 26480 one-exon transcripts in the gtf file

There are 24341 one-transcript genes with only one exon in the transcript

Average number of transcripts per gene is 3.297490

Average number of exons per transcript is 5.932542

Average number of exons per transcript excluding one-exon tx is 6.690228

Average number of gene per geneGroup is 7.609373

statistic: 0.0308849811554

read outcome totals across all BAMs

USED: 6452182

NOT_PAIRED: 0

NOT_NH_1: 11535082

NOT_EXPECTED_CIGAR: 1730942

NOT_EXPECTED_READ_LENGTH: 0

NOT_EXPECTED_STRAND: 0

EXON_NOT_MATCHED_TO_ANNOTATION: 69162039

JUNCTION_NOT_MATCHED_TO_ANNOTATION: 10402403

CLIPPED: 0

total: 99282648

outcomes by BAM written to: ././2021-03-23-18:35:38_930608_read_outcomes_by_bam.txt

novel: 436.540826082

The splicing graph and candidate read have been saved into ././2021-03-23-18:35:38_930608_*.rmats

save: 0.377684116364

loadsg: 0.330173015594

==========

Done processing each gene from dictionary to compile AS events

Found 38686 exon skipping events

Found 2073 exon MX events

Found 12873 alt SS events

There are 7894 alt 3 SS events and 4979 alt 5 SS events.

Found 5834 RI events

==========

ase: 2.53935790062

count: 0.712833881378

Processing count files.

Done processing count files.

Rupert Hugh-White

unread,

Mar 23, 2021, 9:12:48 PM3/23/21

to rMATS User Group

UPDATE

This seems to be related to the --libType option.

If I switch from "fr-secondstrand" to "fr-firststrand" with v4.1.1, the number of reads USED increases to a reasonable/expected value (81513611).

However, the opposite behaviour is seen with v4.1.0 - fr-firststrand gives a low read usage and fr-secondstrand a high read usage.

Has the implementation of --libType options changed between v4.1.0 and v4.1.1?

To determine the libType option to use I ran rseqc infer_experiment.py and received the following results, from which I inferred I should use 'fr-secondstrand'. Is that correct?

This is PairEnd Data

Fraction of reads failed to determine: 0.0304

Fraction of reads explained by "1++,1--,2+-,2-+": 0.1754

Fraction of reads explained by "1+-,1-+,2++,2--": 0.7942

Thanks

kutsc...@gmail.com

unread,

Mar 24, 2021, 8:52:01 AM3/24/21

to rMATS User Group

Yes, the implementation of --libType has changed in v4.1.1. It was mentioned in the release notes as "Bug fix of strand specific read filtering": https://github.com/Xinglab/rmats-turbo/releases/tag/v4.1.1

Here's the code change: https://github.com/Xinglab/rmats-turbo/pull/86

Before that change, the strand was only checked for exon reads, not junction reads. The method for determining whether a particular alignment was consistent with an fr-firststrand library or fr-secondstrand library did not change

The link you provided for determining what to use for --libType (http://rseqc.sourceforge.net/#infer-experiment-py) says that "1++,1--,2+-,2-+" indicates that read 1 should map to the same strand as the gene and read 2 should map to the opposite strand of the gene. "1+-,1-+,2++,2--" indicates that read 1 should map to the opposite strand and read 2 should map to the same strand

The meaning of fr-firststrand and fr-secondstrand in rMATS is consistent with the description here: https://salmon.readthedocs.io/en/latest/library_type.html

fr-firststrand is the same as ISR which indicates that read 1 should map to the opposite strand of the gene and read 2 should map to the same strand as the gene. fr-secondstrand is the same as ISF which indicates that read 1 should map to the same strand as the gene and read 2 should map to the opposite strand

Based on those interpretations, it seems that "1++,1--,2+-,2-+" is the same as fr-secondstrand and that "1+-,1-+,2++,2--" is the same as fr-firststrand. Since the output on your data showed about 80% "1+-,1-+,2++,2--" then it is expected that rMATS v4.1.1 should use more reads with fr-firststrand

Eric

Reply all

Reply to author

Forward