Fwd: Questions related to PrediXcan

71 views
Skip to first unread message

Hae Kyung Im

unread,
May 13, 2024, 4:13:32 PM5/13/24
to PrediXcan/MetaXcan
Hi,
please see below for responses
Haky

 


 

  1. Using GTEx data as the reference transcriptome, you trained prediction models (PredictDB) for multiple tissues in your study. Each tissue in GTEx has different sample size, and I wonder if prediction models were affected by the sample size. Would a larger sample size necessarily give better quality of predicted expression?

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7693040/pdf/GEPI-44-854.pdf

 


Prediction performance will increase with sample size. Also the number of genes that you can predict reliably will increase with the sample size.

Here are the number of gene expression models by tissue and sample size for the finemapped followed by mashr effect smoothing approach.  https://lab-notes.hakyimlab.org/post/2023-03-01-gtex-sample-size-by-tissue/index.html
 

  1. Is there a way for me to train prediction models by myself? I can find the trained models on PredictDB, but ideally, I would try to train models using the data I have.

 

 

  1. I imputed gene expression using some of your trained models for different tissues. I wasn’t sure about the scales of the values. Are they normalized and scaled? Followings are the examples of several genes:
expression used for training are inverse normalized, so they have sd=1. But the prediction can vary a bit around that distribution. Especially if the there are missing snps or the allele frequencies are very different from the reference.

 

 

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.

-0.23584 -0.17877 -0.11210 -0.11817 -0.08691  0.01951

 

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.

-0.02285 -0.01076  0.04197  0.02605  0.05024  0.11422

 

It might be related to how the input expression data is processed.

 

  1. For my imputed expression datasets, there are around 6000-7000 genes. I guess there are usually about 18~19k protein coding genes in human RNA-seq data, so I wonder if there could be any reasons why I’m getting a fewer number of genes.

 


some genes were excluded according to the QC criteria (low expression for example), see GTEx 2020 science paper for details. Others are not included because of low prediction performance (depending on the prediction approach).
 
Reply all
Reply to author
Forward
0 new messages