Quantile normalization?

217 views
Skip to first unread message

Sanjeev Sariya

unread,
Oct 15, 2019, 12:45:39 PM10/15/19
to PrediXcan/MetaXcan
Dear Predixcan team,

I used RNA-seq data, the expression values were then normalized at CPM level using rma, limma R libraries. I used these values further to create database (custom V7 gtex pipeline) and further analysis of gene expression prediction/validation.

I was unable to run script at github for rnaseqnorm thus I moved to CPM normalization approach. https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/rnaseqnorm.py

I used same CPM normalized data further wtih GTEX (v7) brain cortex region database to predict/validate gene expression. I was wondering if I'm my pipeline is wrong?

Thanks.

 

Alvaro Barbeira

unread,
Oct 16, 2019, 10:29:42 AM10/16/19
to Sanjeev Sariya, PrediXcan/MetaXcan
Hi Sanjeev,

It is likely you would need to modify the script from the Broad's GTEx pipeline to work on PredictDBPipeline format. What error did it give you?

I remember the original developer of PredictDB_Pipeline_GTEx_v7 did his own version, here in the github repository: https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7/blob/master/prepare_data/expression/normalize_expression.py

The model training pipeline will work with CPM, but using quantile normalization is likely to yield better models, as some individuals exhibit consistent overexpression. You can try another implementations like https://cran.r-project.org/web/packages/RNOmni/vignettes/RNOmni.html

Best,

Alvaro

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/aafdedff-d8ea-4f5f-b881-fff11e81848c%40googlegroups.com.

Sanjeev Sariya

unread,
Oct 18, 2019, 9:51:37 AM10/18/19
to PrediXcan/MetaXcan
Dear Alvaro,

Thank for your kind reply.
Shell script required RPKM values; I'm unsure if I should use them, links below.

I'd like to ascertain what I'm doing is acceptable in pipeline: I used CPM on default settings to normalize for library sizes, filtered genes. 
I then use rankNorm function in RNOmni library to perform rank based inverse normal transform (INT) on the matrix. I'll now use these data to create database, or test gene imputation using EUR database. 
I hope this should be good and defensible if and when needed.

Sanjeev

---$$###
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.

Alvaro Barbeira

unread,
Oct 25, 2019, 10:14:56 AM10/25/19
to Sanjeev Sariya, PrediXcan/MetaXcan
Hi Sanjeev,

The first link you mention (https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7/blob/master/prepare_data/expression/normalize_v7.sh) and the actual script being run (normalize_expression.py) were adapted from the GTEx eQTL pipeline. Feel free to use them if they fit your needs, but they are by no means mandatory.

Your pipeline (CPM -> RNomni::rankNorm) sounds good and should work just as well as the GTEx pipeline's code.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/a35f6039-3ef0-4501-92da-fcbd2b05fdba%40googlegroups.com.

Sanjeev Sariya

unread,
Nov 4, 2019, 9:56:17 AM11/4/19
to PrediXcan/MetaXcan
Dear Alvaro,

Thank you for your kind reply.
I conducted analyses using aforementioned workflow; count for genes after filtering increases (earlier 18, now 23), however, the yield stays unimpressive (average correlation).

Thank you,
Sanjeev

Alvaro Barbeira

unread,
Nov 4, 2019, 1:31:14 PM11/4/19
to Sanjeev Sariya, PrediXcan/MetaXcan
Dear Sanjeev,

Thank you for the update. I'll keep RNomni::rankNorm in mind.

I assume you refer to the prediction performance R2 (pred.perf.R2 in the database). I'm sorry to hear about the low yield. What R2 values are you getting?

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/ab0f72fd-8789-434e-a1cb-7448900a8e76%40googlegroups.com.

Sanjeev Sariya

unread,
May 22, 2020, 9:46:14 AM5/22/20
to PrediXcan/MetaXcan
Dear Alvaro,

I'm observing an average of ~0.06-0.11 R2.

I'd a concern when normalizing.When I normalize CPM matrix for RINT (rank inverse normal transformation), the RINT is performed on genes. The matrix is in format where rows are genes and samples are columns, so each row follows a RINT x-formation.
Does this sound convincing or is this not it should be done?

best,
Sanjeev
---
###

Alvaro Barbeira

unread,
May 22, 2020, 3:40:32 PM5/22/20
to Sanjeev Sariya, PrediXcan/MetaXcan
Dear Sanjeev,

I'm sorry but I don't understand your question. If your rows are genes, and you rank-normalize each gene separately, then each row gets its own transformation. This sounds ok to me.

Those values of R2 look good. For reference, consider the R2 from GTEX v8 Elastic Net Whole Blood:

> e$pred.perf.R2 %>% quantile
        0%        25%        50%        75%       100%
0.01001099 0.03208228 0.06941480 0.15809833 0.78854053


Median is 0.07, and they were computed on about 500 samples.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/7b129d02-90ec-4f6d-af5f-a3142a53154f%40googlegroups.com.

Sanjeev Sariya

unread,
May 22, 2020, 4:22:49 PM5/22/20
to PrediXcan/MetaXcan
Dear Alvaro,

Yes, that was my question if I should rank-inverse normalize each gene, is that OK? Thanks much for answering. I was in a small confusion if I should perform RINT per sample across all genes; or RINT per gene across all samples. Your reply confirms that I should stick to the latter. https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl

Also, the link above is really helpful; suggests performing RINT per gene after RPKM/TMM/FPKM normalization. 
Thank you for this!

For R2:
Ah, I see, I didn't know that R2 for 500 samples in GTEX whole-blood also performed similarly. That is sort of dampening because it is poor, but at the same time encouraging :D as it is line with publicly available datasets.

Really appreciate your kind replies and guidance,

Cheers,
Sanjeev
~~
------
########

Reply all
Reply to author
Forward
0 new messages