Fwd: Questions related to PrediXcan

73 views

Skip to first unread message

Hae Kyung Im

unread,

May 13, 2024, 4:13:32 PM5/13/24

to PrediXcan/MetaXcan

Hi,

please see below for responses

Haky

Using GTEx data as the reference transcriptome, you trained prediction models (PredictDB) for multiple tissues in your study. Each tissue in GTEx has different sample size, and I wonder if prediction models were affected by the sample size. Would a larger sample size necessarily give better quality of predicted expression?

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7693040/pdf/GEPI-44-854.pdf

Prediction performance will increase with sample size. Also the number of genes that you can predict reliably will increase with the sample size.

Here are the number of gene expression models by tissue and sample size for the finemapped followed by mashr effect smoothing approach. https://lab-notes.hakyimlab.org/post/2023-03-01-gtex-sample-size-by-tissue/index.html

Is there a way for me to train prediction models by myself? I can find the trained models on PredictDB, but ideally, I would try to train models using the data I have.

Try https://github.com/hakyimlab/PredictDB-Tutorial

I imputed gene expression using some of your trained models for different tissues. I wasn’t sure about the scales of the values. Are they normalized and scaled? Followings are the examples of several genes:

expression used for training are inverse normalized, so they have sd=1. But the prediction can vary a bit around that distribution. Especially if the there are missing snps or the allele frequencies are very different from the reference.

Min. 1st Qu. Median Mean 3rd Qu. Max.

-0.23584 -0.17877 -0.11210 -0.11817 -0.08691 0.01951

Min. 1st Qu. Median Mean 3rd Qu. Max.

-0.02285 -0.01076 0.04197 0.02605 0.05024 0.11422

It might be related to how the input expression data is processed.

For my imputed expression datasets, there are around 6000-7000 genes. I guess there are usually about 18~19k protein coding genes in human RNA-seq data, so I wonder if there could be any reasons why I’m getting a fewer number of genes.

some genes were excluded according to the QC criteria (low expression for example), see GTEx 2020 science paper for details. Others are not included because of low prediction performance (depending on the prediction approach).

Reply all

Reply to author

Forward

0 new messages