how to deal with unequal reads length?

jiaojiao guo

unread,

Mar 26, 2019, 2:37:42 AM3/26/19

to rMATS User Group

Hi,

rMATS requires same read length, but after trimming the fastq, the reads length is in a range. if the reads is 100 -150 nt, and I set the length to be 120, but the reads less than 120 will not change. Do I need to discard those reads and then use star to align? Thanks a lot.

Thomas Danhorn

unread,

Mar 29, 2019, 3:36:10 PM3/29/19

to jiaojiao guo, rMATS User Group

The only way is to pick length where you don't have to discard too many
reads that are shorter, but keep in mind that longer is better in terms of
junction spanning and mapping specificity. Then you trim all your reads
to that length and discard anything shorter, hoping that you don't
introduce any bias doing that ... If you have enough reads, discarding a
few percent is not a big deal. If you are really curious, you could take
all the reads mapping to a specific chromosome or region (to get a small
sample with good coverage) and then trim to different length (e.g.
140/120/100/80) and run rMATS with each, comparing the differences.

Thomas

> --
> You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/37da6aad-277e-4508-82ea-e981383e2a06%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Shihao Shen

unread,

Apr 1, 2019, 11:46:31 AM4/1/19

to rMATS User Group

Hi Thomas & Jiaojiao,

rMATS-turbo 4.0.2 does not discard reads based on length setting. But the length setting will affect the length normalization function and PSI value calculation.

Length normalization function uses the effective inclusion and skipping length as displayed in the JC and JCEC file to do the PSI calculation. Since effective length depends on the read length, wrong read length will lead to wrong effective length.

We are working on a better and more precise model to account for all length in rMATS-ISO, but it is still in very early beta. For a temporary work-around, I will suggest using the average read length after trimming the very short reads.

Shihao Shen

rMATS Developer

Scientist II, Children's Hospital of Philadelphia

Google scholar link: https://scholar.google.com/citations?user=VX7cW9cAAAAJ&hl=en

jiaojiao guo

unread,

Apr 5, 2019, 4:51:18 PM4/5/19

to Thomas Danhorn, shiha...@gmail.com, rmats-us...@googlegroups.com

Hi Thomas & Shihao,

Thank you very much for your explanation. As I am very junior in this field, I still have some naive questions about the results.

I checked the sashimi plots of events filtered by FDR < 0.01 & delta ψ >0.05 and found that

1. many counts of junction reads are very small like less than 10. Is it worth to explore further? According to your experience, what is the appropriate threshold for the junction reads count?

2. some genes like figure 258 have many reads in exons but very low reads in introns while genes like figure 259 have no enrichment in exon or introns, its read peaks are scattered along all the region. Can I say the first one is more acceptable for downstream analysis?

3. for the figure 2, it has no reads including the intron, but its total count is very low compared with the WT, is it possible that the difference of splicing is caused by expression not splicing? Did rMATS normalize the total reads count when computing the inclusion level?

I am not sure if I made my questions clear. Thank you in advance.

Best

Jiaojiao

--

You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/f5f37b9b-4e5c-41f7-9a7d-3cab794ae9fe%40googlegroups.com.

258_CAMSAP1_chr9_135907000_135907224_-@chr9_135881633_135881794_-@chr9_135866379_135866536_-.pdf

259_HKR1_chr19_37324208_37324265_+@chr19_37324796_37324936_+@chr19_37347190_37347196_+.pdf

2_NMRK1_chr9_75069895_75070042_-@chr9_75068996_75070042_-@chr9_75068996_75069102_-.pdf

孙海峰

unread,

Apr 10, 2019, 8:48:00 AM4/10/19

to rMATS User Group

I have the same question with you

Jane Xie

unread,

Sep 18, 2020, 7:47:47 PM9/18/20

to rMATS User Group

I have the same question with you too.

Thomas Danhorn

unread,

Sep 18, 2020, 11:24:32 PM9/18/20

to Jane Xie, rMATS User Group

Not sure what that question is exactly, but if it is about dealing with
reads of unequal length, version 4.1.0 has a new option called
--variable-read-length. What I normally do is throw away very short
reads, so all remaining reads are roughly (but not exactly, due to adapter
trimming) the same length and then use the average length with the
--readLength option, in combination with --variable-read-length.
Apparently the statistics are pretty robust to changing read lengths (see
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/rmats-user-group/ZCxjlQfP9ak/nDxuT3tOAgAJ),
so I don't worry about a bit of variability.

If you have different data sets that were sequenced at different length,
you should be cautious comparing those, as any differences in methodology
(different RNA extraction kit, library build kits, sequencing chemistries,
etc.) may introduce a systematic bias that would influence your results.
If you are sure you want to do this, I would trim the longer reads to the
length of the shorter ones and proceed as above.

Hope this helps,

Thomas

> --
> You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/6b5643fb-51f1-4688-bee7-2a8eb6e4ae2en%40googlegroups.com.
>

Jane Xie

unread,

Sep 19, 2020, 11:30:35 AM9/19/20

to rMATS User Group

Hi Thomas,

Thank you so much for your reply. Under this thread, there are two questions. The first question is about dealing with reads of unequal length. The second question is about how to interpret the results. My question is the second. The original author attached three results and asked three questions. Could you please take a look at those three results in the attachment?

Thank you very much.

Sincerely yours,

Jane

Syang

unread,

Sep 30, 2020, 5:09:09 AM9/30/20

to rMATS User Group

Hi Thomas ,

I was a bit confused about the new version 4.1.0 of the "--variable-read-length" option, but after reading your reply, I basically understand the meaning of this option, but I still want to make sure about my processing of my data is it right or not.

I have a treatment group and a control group, each group has three replications, so a total of 6 bam files, the reads are 30-101 nt, and the average length of the reads of each bam file is 95, 96, 96, 95, 96,95. For the "--readLength" option, I will get (95+96+96+95+96+95)/6 to get 95.5. Should I set the parameter to 95.5 or set the parameter to 101? There are only bam files, no original fastq files, so there is no method to discard short reads and make all reads the same length.

Can you tell me what should I do?Thank you in advance.

Best

Shuai Yang

Thomas Danhorn

unread,

Sep 30, 2020, 5:26:13 PM9/30/20

to Syang, rMATS User Group

Hi Shuai,

The idea behind the read length is to let rMATS calculate the probability
that a read will e.g. cross a certain junction (for shorter reads that is
lower than for longer reads), so I always use the average (in your case
95.5, although I am not sure you can use fractional numbers, so you might
have to round it). As mentioned, this value does not seem to have a large
influence on the results, so it should be fine.

Best,

Thomas

> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/1d227416-a5de-43fd-9f27-7022f7e5fbcan%40googlegroups.com.
>

X L

unread,

Oct 25, 2024, 7:08:52 PM10/25/24

to rMATS User Group

I do not understand why the length normalization function applies to both JC and JCEC to do the PSI calculation, it seems to me that it may apply to JCEC Psi calculation, since in the case of JC, the junction should not have any length.

kutsc...@gmail.com

unread,

Oct 28, 2024, 9:40:22 AM10/28/24

to rMATS User Group

From http://dx.doi.org/10.1073/pnas.1419161111
> Given the effective lengths (i.e., the number of unique isoform-specific read positions)

The effective length is counting how many different start coordinates a read could have for the given read length where the read would cover the junction (or otherwise support that isoform). That makes sense for RNA-Seq where each fragment being sequenced has a more or less random start position and then the read continues "read length" bases from that coordinate. The diagram in the README shows the effective length calculation: https://github.com/Xinglab/rmats-turbo/tree/v4.3.0?tab=readme-ov-file#output

For example the effective length of the skipping isoform of a skipped exon event is r - (2*a) + 1 where "r" is the read length and "a" is the required anchor length on each side of the junction. The read needs to start at least anchor length before the junction and also end at least anchor length after the junction. Starting right before the junction is one possible position (when the anchor length is 1). Also starting two bases before the junction works and you can keep moving the start position back up to r - (2*a) times

Eric

Reply all

Reply to author

Forward