Hi Juan,
The process is quite involved, and I give an overall description below. Let me know if you run into any issues.
Best,
Alvaro
PredictDBPipeline:
I would advice to first look at the older pipeline's wiki (https://github.com/hakyimlab/PredictDBPipeline/wiki/Detailed_Description).
It contains a tutorial with sample data (https://github.com/hakyimlab/PredictDBPipeline/wiki/Tutorial).
It might also help to look at GTEx eQTL pipeline (https://github.com/broadinstitute/gtex-pipeline)
The v7 model training pipeline differs from the old one mostly in the way the input data is processed, the file formats, and how the prediction performance is evaluated. The old documentation will give you an overall sense of the steps involved, and the tutorial a feel of how the format files should be. The pipeline relies on imputing data to variants available to 1000Genomes; we used the Michigan Imputation server for that. You might not need such a step; for example, GTEx v8 data is already imputed.
The scripts expect the data to be in a particular folder structure and data layout. Prior to running, the project folder should look like this:
|-- model_training
| |-- analysis
| |-- covariances
| |-- dbs
| |-- scripts
| |-- summary
| `-- weights
`-- prepare_data
|-- covariates
|-- expression
|-- genotype
So, you would need to clone the code and create the missing folders. Several scripts are particular to the HPC cluster available to us, and have hardcoded paths that you would need to change or remove. You should search for the "/group/im-lab/nas40t2/scott/gtex_v7_imputed_europeans/model_training/scripts/" string and modify accordingly.
There are a few file formats to be aware of.
Expression and covariate data files:
Expression and covariates are tab-separated text files with gzip-compression. They look like:
NAME IND1 IND2 IND3 ...
ENSG00000227232.5 -0.81 -0.29 0.31 ...
…
The first row is the header. For expression, the first column must be NAME. Each row contains the gene expression values for the individuals in your sample. For covariates, the first column is ID and each row contains a covariate such as ancestral PCs, peer factors, etc. The pipeline doesn't allow missing values and individuals must be the same in expression and covariate files.
They are expected to have the following name and path conventions:
prepare_data/expression/Tissue1_Analysis.expression.txt
prepare_data/covariates/Tissue1_Analysis.combined_covariates.txt
Expression data is inverse quantile normalized.
Gene annotation:
You must provide a text file containing the list of genes you want to train on. It can contain less entries than those available to the expression data files. The file looks like this:
chr gene_id gene_name start end gene_type
1 ENSG00000243485.5 MIR1302-2HG 29554 31109 lincRNA
1 ENSG00000237613.2 FAM138A 34554 36081 lincRNA
1 ENSG00000186092.4 OR4F5 69091 70008 protein_coding
1 ENSG00000238009.6 RP11-34P13.7 89295 133723 lincRNA
...
It is expected at prepare_data/expression/gencode.v19.genes.patched_contigs.parsed.txt as we used GENCODE release 19. If you want to use a different one, you should search in the code for the file name or the path stem and change it accordingly.
Genotype data:
Measured genotypes are text files with the following format:
varID IND1 IND2 IND3 ...
1_54421_A_G_b37 1 0 0 ...
...
So, these files are similar to the expression and covariate files. Dosages are expected to be in the [0,2] range. Missing values are not supported.
There must be companion variant annotation files. They are text files with the following format:
chromosome pos varID ref_vcf alt_vcf R2 MAF rsid rsid_dbSNP150
1 566875 1_566875_C_T_b37 C T 0.9747600000000001 0.03085 rs2185539 rs2185539
The variant ids need not be in the "{chr}_{pos}_{NEA}_{EA}_b37" format; the variant annotation file must provide the mapping from variant id to rsid.
Genotype and annotation files are expected to be split by chromosome.
Scripts:
submit_training_jobs.sh is merely a wrapper that will submit model training jobs, one per tissue-chromosome pair, to a PBS queue.
gtex_tiss_chr_elasticnet.pbs is a bash-like script describing the job, used by the above script.
gtex_tiss_chrom_training.R is an R script that maps the file names and folder structure to the actual model training script, used by the above script. You must modify this script to reflect the naming conventions of your data. For example, the existing pipeline assumes genotypes at:
prepare_data/genotype/gtex_v8_eur_shapeit2_phased_maf01_snp_annot.chr1.txt
, and in this script you can change the filename to, for example, prepare_data/genotype/my_data.chr1.txt
Here you can also change the expression and covariate file names.
gtex_v7_nested_cv_elnet.R is the main model training script, used by the above script.
Once you run the model training, you need to compile its output into the actual prediction models (and covariance files for S-PrediXcan). First you must run filter_dbs.R and then make_dbs.R scripts. To create covariances you must use the create_covariances.py script. All of these files assume a particular name convention for the output and must be updated accordingly to your desires.
Final words:
This covered the core of the model training scripts.
The other scripts in the repository are tied to GTEx data particulars, such as parsing the genotype vcfs, doing inverse-quantile-normalization on expression data, etc. It is likely you will not need them.
--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetaxcan+unsubscribe@googlegroups.com.
To post to this group, send email to predixcanmetaxcan@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/5ad74fcb-d208-4ed7-99bd-6472f2a65562%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetaxcan+unsubscribe@googlegroups.com.
To post to this group, send email to predixcanmetaxcan@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/afac8faf-3bc7-4b4b-8345-94428fe55f89%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To post to this group, send email to predixca...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetaxcan+unsubscribe@googlegroups.com.
To post to this group, send email to predixcanmetaxcan@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/a75d1c66-4bbb-44e7-9b5f-b0c1709c2f94%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/8664344c-d924-48c7-81e9-6a3b8679390c%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To post to this group, send email to predixca...@googlegroups.com.
options(error = traceback)
to the top of the gtex_v7_nested_cv_elnet.R file; this should give a more detailed error profile.
Best,
Alvaro
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/61419d11-cc7c-45f1-90a5-f7c4a777ab9c%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/61419d11-cc7c-45f1-90a5-f7c4a777ab9c%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/9fba8c40-a974-48e3-abbb-aa31fd95985d%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/9fba8c40-a974-48e3-abbb-aa31fd95985d%40googlegroups.com.
varID IND1 IND2 IND3 ...
1_54421_A_G_b37 1 0 0 ...
...
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/be93aaa1-e3b5-44ed-8245-702ef9f782b6%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/be93aaa1-e3b5-44ed-8245-702ef9f782b6%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/744cac76-7a7b-4835-a8db-0c01cc59e439%40googlegroups.com.
Hi folks, thank you for this helpful discussion!
Is there an updated tutorial based on the scripts from https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7 ?
There's no equivalent script names, based on the older pipeline's wiki (https://github.com/hakyimlab/PredictDBPipeline/wiki/Detailed_Description), so I am a bit confused as to where to start (to create my own weights for the new S-PrediXcan). Or should I just use the old scripts and run the old PrediXcan?
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/a75d1c66-4bbb-44e7-9b5f-b0c1709c2f94%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/6a0fc96a-c2da-4d4f-8729-491cccc4c168n%40googlegroups.com.