Quantile normalization?

Sanjeev Sariya

unread,

Oct 15, 2019, 12:45:39 PM10/15/19

to PrediXcan/MetaXcan

Dear Predixcan team,

I used RNA-seq data, the expression values were then normalized at CPM level using rma, limma R libraries. I used these values further to create database (custom V7 gtex pipeline) and further analysis of gene expression prediction/validation.

I was unable to run script at github for rnaseqnorm thus I moved to CPM normalization approach. https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/rnaseqnorm.py

I used same CPM normalized data further wtih GTEX (v7) brain cortex region database to predict/validate gene expression. I was wondering if I'm my pipeline is wrong?

Thanks.

https://support.bioconductor.org/p/77664/

Alvaro Barbeira

unread,

Oct 16, 2019, 10:29:42 AM10/16/19

to Sanjeev Sariya, PrediXcan/MetaXcan

Hi Sanjeev,

It is likely you would need to modify the script from the Broad's GTEx pipeline to work on PredictDBPipeline format. What error did it give you?

I remember the original developer of PredictDB_Pipeline_GTEx_v7 did his own version, here in the github repository: https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7/blob/master/prepare_data/expression/normalize_expression.py

The model training pipeline will work with CPM, but using quantile normalization is likely to yield better models, as some individuals exhibit consistent overexpression. You can try another implementations like https://cran.r-project.org/web/packages/RNOmni/vignettes/RNOmni.html

Best,

Alvaro

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/aafdedff-d8ea-4f5f-b881-fff11e81848c%40googlegroups.com.

Sanjeev Sariya

unread,

Oct 18, 2019, 9:51:37 AM10/18/19

to PrediXcan/MetaXcan

Dear Alvaro,

Thank for your kind reply.

Shell script required RPKM values; I'm unsure if I should use them, links below.

https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7/blob/master/prepare_data/expression/normalize_v7.sh

I'd like to ascertain what I'm doing is acceptable in pipeline: I used CPM on default settings to normalize for library sizes, filtered genes.

I then use rankNorm function in RNOmni library to perform rank based inverse normal transform (INT) on the matrix. I'll now use these data to create database, or test gene imputation using EUR database.

I hope this should be good and defensible if and when needed.

https://old.reddit.com/r/bioinformatics/comments/cdk44b/is_anyone_familiar_with_doing_differential/etydop4/

https://old.reddit.com/r/bioinformatics/comments/3lfm64/how_to_normalize_rnaseq_samples_quantified_using/cv61y8j/

Thanks,

Sanjeev

---$$###

To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.

Alvaro Barbeira

unread,

Oct 25, 2019, 10:14:56 AM10/25/19

to Sanjeev Sariya, PrediXcan/MetaXcan

Hi Sanjeev,

The first link you mention (https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7/blob/master/prepare_data/expression/normalize_v7.sh) and the actual script being run (normalize_expression.py) were adapted from the GTEx eQTL pipeline. Feel free to use them if they fit your needs, but they are by no means mandatory.

Your pipeline (CPM -> RNomni::rankNorm) sounds good and should work just as well as the GTEx pipeline's code.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/a35f6039-3ef0-4501-92da-fcbd2b05fdba%40googlegroups.com.

Sanjeev Sariya

unread,

Nov 4, 2019, 9:56:17 AM11/4/19

to PrediXcan/MetaXcan

Dear Alvaro,

Thank you for your kind reply.

I conducted analyses using aforementioned workflow; count for genes after filtering increases (earlier 18, now 23), however, the yield stays unimpressive (average correlation).

Thank you,

Sanjeev

To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/a35f6039-3ef0-4501-92da-fcbd2b05fdba%40googlegroups.com.

Alvaro Barbeira

unread,

Nov 4, 2019, 1:31:14 PM11/4/19

to Sanjeev Sariya, PrediXcan/MetaXcan

Dear Sanjeev,

Thank you for the update. I'll keep RNomni::rankNorm in mind.

I assume you refer to the prediction performance R2 (pred.perf.R2 in the database). I'm sorry to hear about the low yield. What R2 values are you getting?

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/ab0f72fd-8789-434e-a1cb-7448900a8e76%40googlegroups.com.

Sanjeev Sariya

unread,

May 22, 2020, 9:46:14 AM5/22/20

to PrediXcan/MetaXcan

Dear Alvaro,

I'm observing an average of ~0.06-0.11 R2.

I'd a concern when normalizing.When I normalize CPM matrix for RINT (rank inverse normal transformation), the RINT is performed on genes. The matrix is in format where rows are genes and samples are columns, so each row follows a RINT x-formation.

Does this sound convincing or is this not it should be done?

best,

Sanjeev

---

###

To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/ab0f72fd-8789-434e-a1cb-7448900a8e76%40googlegroups.com.

Alvaro Barbeira

unread,

May 22, 2020, 3:40:32 PM5/22/20

to Sanjeev Sariya, PrediXcan/MetaXcan

Dear Sanjeev,

I'm sorry but I don't understand your question. If your rows are genes, and you rank-normalize each gene separately, then each row gets its own transformation. This sounds ok to me.

Those values of R2 look good. For reference, consider the R2 from GTEX v8 Elastic Net Whole Blood:

> e$pred.perf.R2 %>% quantile
0% 25% 50% 75% 100%
0.01001099 0.03208228 0.06941480 0.15809833 0.78854053

Median is 0.07, and they were computed on about 500 samples.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/7b129d02-90ec-4f6d-af5f-a3142a53154f%40googlegroups.com.

Sanjeev Sariya

unread,

May 22, 2020, 4:22:49 PM5/22/20

to PrediXcan/MetaXcan

Dear Alvaro,

Yes, that was my question if I should rank-inverse normalize each gene, is that OK? Thanks much for answering. I was in a small confusion if I should perform RINT per sample across all genes; or RINT per gene across all samples. Your reply confirms that I should stick to the latter. https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl

Also, the link above is really helpful; suggests performing RINT per gene after RPKM/TMM/FPKM normalization.

Thank you for this!

For R2:

Ah, I see, I didn't know that R2 for 500 samples in GTEX whole-blood also performed similarly. That is sort of dampening because it is poor, but at the same time encouraging :D as it is line with publicly available datasets.

Really appreciate your kind replies and guidance,

Cheers,

Sanjeev

~~

------

########

To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/7b129d02-90ec-4f6d-af5f-a3142a53154f%40googlegroups.com.

Reply all

Reply to author

Forward