Size Factor Normalization

Enrique Goñi Echeverría

unread,

May 27, 2021, 6:00:55 AM5/27/21

to rMATS User Group

Hey,

First, thank you very much for implementing rMATS, it's super helpful and well documented.

I have looked through all the documentation, the paper and this group without finding much on whether rMATS does a Size Factor Normalization (or library size factor normalization). So, let's say I am comparing two conditions (A and B, without replicates). Sample A has 100M reads and sample B 10M reads. Results or reads falling in ASE events will be biased by the sequenced depth or the differences in the library sizes.

I know rMATS considers and normalizes for the ASE lengths, but does it normalises for the library sizes??

Thanks for any help about this.

Best,

Kike

kutsc...@gmail.com

unread,

May 27, 2021, 3:54:59 PM5/27/21

to rMATS User Group

I'm less familiar with the details of the rMATS statistical model than with other parts of the rMATS code, but my understanding is that the rMATS model does address the effects of sequencing depth, but it does not perform size factor normalization

Here are some quotes from this rMATS paper: http://dx.doi.org/10.1073/pnas.1419161111

> rMATS uses a hierarchical framework to simultaneously model the variability among replicates and the estimation uncertainty of isoform proportion in individual replicates. It should be noted that the estimation uncertainty of isoform proportion is a well-recognized issue in RNA-Seq analysis of alternative splicing, because the confidence level of such estimates depends on the sequencing coverage for individual splicing events

> Assuming that the inclusion read count I follows a binomial distribution with the total read count n=I+S, we have
[equation 1 from that paper]
where the binomial distribution models the estimation uncertainty of psi as influenced by the total read count n, and the proportion of reads from the exon inclusion isoform is represented by the length normalization function f(psi) that normalizes the exon inclusion level psi by the effective lengths of the isoforms.

As you mentioned, rMATS is considering the lengths of the different isoforms for an alternative splicing event. rMATS is also using the total read count (n=I+S) as part of the model. If the low sequencing depth for one of the samples (10M versus 100M) results in a low total read count for an event in one of the replicates, then rMATS will account for that as "estimation uncertainty of isoform proportion in [an] individual [replicate]"

Eric

Enrique Goñi Echeverría

unread,

May 28, 2021, 4:20:22 AM5/28/21

to rMATS User Group

Hey,

Thanks a lot for the clarification. I still have a question about it. So I understand that rMAST considers and normalizes the lengths of the different isoforms. But I still do not understand how it corrects the effects of sequencing depth. Going back to the previous example in which we have two samples A (with 10M reads) and B (with 100M reads), each from one condition, if an isoform has 10 counts in sample A and 100 counts in sample B, that would be normalized to accounted for sequencing depth?

Thanks in advance for all your help,

Best,

Kike

Thomas Danhorn

unread,

May 28, 2021, 12:23:55 PM5/28/21

to Enrique Goñi Echeverría, rMATS User Group

Hi Kike,

I am not a statistician, so take this with a grain of salt, but the big
difference between gene-level expression analysis and splicing analysis is
that in the first case you compare all reads in sample (group) A to all
reads in sample (group) B, whereas with splicing you compare all reads
*within* a sample that support the exon-skipping form to all reads that
support the form including the exon (in the case of SE; the other cases
are analogous). So for splicing analysis, you don't compare absolute
expression levels, but *ratios/percentages* (the PSI numbers), and while
you need to normalize the absolute expression levels for gene expression
analysis, the ratios/percentages are already normalized (the range
is always 0%-100%). Where the sequencing depth comes in for rMATS is the
*confidence* you have in a certain ratio: If the ratio between
spliceforms A and B is calculated as 1:3, you will have confidence that it
is in fact pretty close to that if you have 1000 reads supporting A and
3000 reads supporting B; but if the support is only 1 read and 3 reads,
respectively, the ratio might as well be 1:1 or 1:5, you just don't have
enough data to be sure. So events with low coverage and resulting high
uncertainty are not going to be significant, even if the ratio is high,
whereas well supported events will be significant, even if the change is
small.

I hope this helps you in thinking about splicing events and the associated
statistics, even without getting into the details of the exact
methodology.

Best,

Thomas

> --
>
>
> *Este mensaje puede contener información confidencial. Si usted no es el
> destinatario del mismo o lo ha recibido por error, por favor, bórrelo de
> sus sistemas y comuníquelo a la mayor brevedad al remitente. Los datos
> personales incluidos en los correos electrónicos que intercambie con el
> personal de la Universidad de Navarra podrán ser almacenados en la libreta
> de direcciones de su interlocutor y/o en los servidores de la Universidad
> durante el tiempo fijado en su política interna de conservación de
> información. La Universidad de Navarra gestiona dichos datos con fines
> meramente operativos, para permitir el contacto por email entre sus
> trabajadores/colaboradores y terceros. Puede consultar la Política de
> Privacidad de la Universidad de Navarra en la dirección:
> **https://www.unav.edu/aviso-legal* <https://www.unav.edu/aviso-legal>****
>
> ** **
>
> *This email message may contain confidential information. If you are
> not the intended recipient of this message or their agent, or if this
> message has been addressed to you in error, please immediately alert the
> sender by reply email and then delete this message and any attachments.
> The personal information included in email messages exchanged with
> employees of the University of Navarra may be stored in the database of
> your interlocutor and/or the servers of the University for the time-period
> stipulated by its internal information storage policy. The University
> stores such data for purely administrative purposes, to facilitate e-mail
> contact between its employees and third parties. The University of Navarra
> Privacy Policy may be accessed at https://www.unav.edu/aviso-legal
> <https://www.unav.edu/aviso-legal> *****
>
> ** **
>
> _Antes de imprimir
> este mensaje o sus documentos anexos, asegúrese de que es necesario.
> Proteger el medio ambiente está en nuestras manos.
> Before printing this
> e-mail or attachments, be sure it is necessary. _It is in our hands to
> protect the environment.__
>
> --
> You received this message because you are subscribed to the Google Groups "rMATS User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rmats-user-gro...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rmats-user-group/9c681de6-79de-4a50-8d74-005b5846dbf4n%40googlegroups.com.
>

Thomas Danhorn

unread,

May 28, 2021, 12:34:36 PM5/28/21

to Enrique Goñi Echeverría, rMATS User Group

I should add that length normalization is necessary for rMATS (but not
really for differential gene expression, although it does no harm there,
as you will normalize all samples with the same factor derived from the
size of the gene), because the number of reads supporting a specific
splice form is proportional to the length of the splice form, so if you
want your PSI values to reflect molar ratios (rather than weight ratios),
you need to normalize for length, unless you are exclusively looking at
junction-spanning reads (junctions have no length, but I guess you need to
account for the number of junctions in each isoform).

Thomas

kutsc...@gmail.com

unread,

May 28, 2021, 1:44:30 PM5/28/21

to rMATS User Group

I agree with Thomas's explanation. Here's some example output from the rMATS statistical model that shows how the total counts change the p-value even though the ratio within each sample stays the same:

ID    IJC_SAMPLE_1    SJC_SAMPLE_1    IJC_SAMPLE_2    SJC_SAMPLE_2    IncFormLen    SkipFormLen    PValue
1    4    2    1    3    200    100    0.122785944253
2    4    2    10    30    200    100    0.0880542304667
3    40    20    1    3    200    100    0.0430491441119
4    40    20    10    30    200    100    7.65888041131e-05

The ratio of inclusion to exclusion reads for sample 1 is 4:2 in all rows. The ratio for sample 2 is 1:3 in all rows. In the first row both samples have a small number of total reads and the p-value is large (0.1227). In rows 2 and 3 one of the samples has a lot more reads, but the other still has only a few reads and the p-value is not that small. In the last row both samples have more reads and the p-value is very small (0.00007)

Eric

Reply all

Reply to author

Forward