- Using GTEx data as the reference transcriptome, you trained prediction models (PredictDB) for multiple tissues in your study. Each tissue in GTEx has different sample size, and I wonder if prediction models were affected by the sample size. Would a larger sample size necessarily give better quality of predicted expression?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7693040/pdf/GEPI-44-854.pdf
- Is there a way for me to train prediction models by myself? I can find the trained models on PredictDB, but ideally, I would try to train models using the data I have.
- I imputed gene expression using some of your trained models for different tissues. I wasn’t sure about the scales of the values. Are they normalized and scaled? Followings are the examples of several genes:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.23584 -0.17877 -0.11210 -0.11817 -0.08691 0.01951
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.02285 -0.01076 0.04197 0.02605 0.05024 0.11422
It might be related to how the input expression data is processed.
- For my imputed expression datasets, there are around 6000-7000 genes. I guess there are usually about 18~19k protein coding genes in human RNA-seq data, so I wonder if there could be any reasons why I’m getting a fewer number of genes.