First steps to run S-prediXcan in custom data

2,757 views
Skip to first unread message

Juan Fernandez-Tajes

unread,
Jun 26, 2018, 7:48:07 AM6/26/18
to PrediXcan/MetaXcan
Hi group,

I would like to apply prediXcan to a set of three tissues that I have, actually is only a tissue but three different locations. Could you please tell me how to generate my own databases to be run with S-prediXcan? I know that the pipeline to create a database following is in https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7, however could someone please tell me which scripts and which steps should do I follow to create the input needed for running prediXcan?

Thanks
Juan

Alvaro Barbeira

unread,
Jun 26, 2018, 9:57:34 AM6/26/18
to Juan Fernandez-Tajes, PrediXcan/MetaXcan

Hi Juan,


The process is quite involved, and I give an overall description below. Let me know if you run into any issues.


Best,


Alvaro




PredictDBPipeline:


I would advice to first look at the older pipeline's wiki (https://github.com/hakyimlab/PredictDBPipeline/wiki/Detailed_Description).


It contains a tutorial with sample data (https://github.com/hakyimlab/PredictDBPipeline/wiki/Tutorial).


It might also help to look at GTEx eQTL pipeline (https://github.com/broadinstitute/gtex-pipeline)


The v7 model training pipeline differs from the old one mostly in the way the input data is processed, the file formats, and how the prediction performance is evaluated. The old documentation will give you an overall sense of the steps involved, and the tutorial a feel of how the format files should be. The pipeline relies on imputing data to variants available to 1000Genomes; we used the Michigan Imputation server for that. You might not need such a step; for example, GTEx v8 data is already imputed.


The scripts expect the data to be in a particular folder structure and data layout. Prior to running, the project folder should look like this:


|-- model_training

|   |-- analysis

|   |-- covariances

|   |-- dbs

|   |-- scripts

|   |-- summary

|   `-- weights

`-- prepare_data

    |-- covariates

    |-- expression

    |-- genotype



So, you would need to clone the code and create the missing folders. Several scripts are particular to the HPC cluster available to us, and have hardcoded paths that you would need to change or remove. You should search for the "/group/im-lab/nas40t2/scott/gtex_v7_imputed_europeans/model_training/scripts/" string and modify accordingly.


There are a few file formats to be aware of.



Expression and covariate data files:

Expression and covariates are tab-separated text files with gzip-compression. They look like:


NAME              IND1 IND2 IND3 ...

ENSG00000227232.5 -0.81 -0.29 0.31 ...


The first row is the header. For expression, the first column must be NAME. Each row contains the gene expression values for the individuals in your sample. For covariates, the first column is ID and each row contains a covariate such as ancestral PCs, peer factors, etc. The pipeline doesn't allow missing values and individuals must be the same in expression and covariate files.

They are expected to have the following name and path conventions:

prepare_data/expression/Tissue1_Analysis.expression.txt

prepare_data/covariates/Tissue1_Analysis.combined_covariates.txt


Expression data is inverse quantile normalized.


Gene annotation:

You must provide a text file containing the list of genes you want to train on. It can contain less entries than those available to the expression data files. The file looks like this:


chr gene_id           gene_name start end    gene_type

1   ENSG00000243485.5 MIR1302-2HG   29554 31109 lincRNA

1   ENSG00000237613.2 FAM138A       34554 36081 lincRNA

1   ENSG00000186092.4 OR4F5         69091 70008 protein_coding

1   ENSG00000238009.6 RP11-34P13.7  89295 133723 lincRNA

...


It is expected at prepare_data/expression/gencode.v19.genes.patched_contigs.parsed.txt as we used GENCODE release 19. If you want to use a different one, you should search in the code for the file name or the path stem and change it accordingly.


Genotype data:

Measured genotypes are text files with the following format:


varID IND1 IND2 IND3 ...

1_54421_A_G_b37 1   0 0 ...

...


So, these files are similar to the expression and covariate files. Dosages are expected to be in the [0,2] range. Missing values are not supported.


There must be companion variant annotation files. They are text files with the following format:

chromosome pos varID ref_vcf alt_vcf R2 MAF rsid rsid_dbSNP150

1          566875 1_566875_C_T_b37 C T 0.9747600000000001 0.03085 rs2185539 rs2185539


The variant ids need not be in the "{chr}_{pos}_{NEA}_{EA}_b37" format; the variant annotation file must provide the mapping from variant id to rsid.


Genotype and annotation files are expected to be split by chromosome.


Scripts:


submit_training_jobs.sh is merely a wrapper that will submit model training jobs, one per tissue-chromosome pair, to a PBS queue.


gtex_tiss_chr_elasticnet.pbs is a bash-like script describing the job, used by the above script.


gtex_tiss_chrom_training.R is an R script that maps the file names and folder structure to the actual model training script, used by the above script. You must modify this script to reflect the naming conventions of your data. For example, the existing pipeline assumes genotypes at:


prepare_data/genotype/gtex_v8_eur_shapeit2_phased_maf01_snp_annot.chr1.txt


, and in this script you can change the filename to, for example,  prepare_data/genotype/my_data.chr1.txt

Here you can also change the expression and covariate file names.


gtex_v7_nested_cv_elnet.R is the main model training script, used by the above script.


Once you run the model training, you need to compile its output into the actual prediction models (and covariance files for S-PrediXcan). First you must run filter_dbs.R and then make_dbs.R scripts. To create covariances you must use the create_covariances.py script. All of these files assume a particular name convention for the output and must be updated accordingly to your desires.



Final words:

This covered the core of the model training scripts.


The other scripts in the repository are tied to GTEx data particulars, such as parsing the genotype vcfs, doing inverse-quantile-normalization on expression data, etc. It is likely you will not need them.










--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetaxcan+unsubscribe@googlegroups.com.
To post to this group, send email to predixcanmetaxcan@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/5ad74fcb-d208-4ed7-99bd-6472f2a65562%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Juan Fernandez-Tajes

unread,
Jun 28, 2018, 4:34:45 AM6/28/18
to PrediXcan/MetaXcan
Thanks a lot Alvaro, one more question:

I have imputed variants using HRC (~8,000,000 variants), which number of variants do you recommend to use? Those with an rsid in dbSNP?, a certain number? Those overlapping with 1000 Genomes or HaMap? Or just feed in all of them?

Thanks a lot
Juan

Alvaro Barbeira

unread,
Jun 28, 2018, 9:19:55 AM6/28/18
to Juan Fernandez-Tajes, PrediXcan/MetaXcan
Hi Juan,

It depends a bit on your scenario.

For example, we publish models based on HapMap CEU snps with MAF > 0.01 on the sample set, to achieve robustness when running with a broad set of GWAS, possibly based on different human genome release versions. This is not the best choice for every GWAS out there; if a GWAS in particular is imputed in a recent version of 1000 Genomes, using those snps might work better.

If your use case is centered on GWAS's imputed to a particular panel, I would try using that to decide. I.E if you plan to use GWAS imputed to HRC, you might want to consider training with all variants in HRC with MAF>0.01 (depending on your transcriptome smaple size). Having an rsid in dbsnp is probably a safe criteria. If you have custom ids, you can just use that as rsid in the models and PrediXcan tools. Even if it's called rsid in the models and code, there is no restriction to their values.

Best,

Alvaro



--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetaxcan+unsubscribe@googlegroups.com.
To post to this group, send email to predixcanmetaxcan@googlegroups.com.

Otto Bergman

unread,
Jul 17, 2018, 9:28:23 AM7/17/18
to PrediXcan/MetaXcan
hi Alvaro,

Thank you for this thorough guide. I still have some questions about building models from my own data.
The scripts you mentioned, e.g. submit_training_jobs.sh, gtex_tiss_chr_elasticnet.pbs, can not be found anywhere in the cloned folders from hakyimlab/PredictDBPipeline. Where are these scripts?
From the script names, it looks like the model building is based on GTeX data? I would like to build a model based on my own expression and genotypic data, is this possible?
Thank you for your input.
Looking forward to learning more.

/Otto
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To post to this group, send email to predixca...@googlegroups.com.

Alvaro Barbeira

unread,
Jul 17, 2018, 9:39:24 AM7/17/18
to Otto Bergman, PrediXcan/MetaXcan
Hi Otto,


hakyimlab/PredictDBPipeline was assembled for GTEx release v6p and is considered deprecated; but the documentation in its wiki might be helpful as a start.

Users have employed both pipelines to train models, although I expect modifications are in other for each specific data set.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetaxcan+unsubscribe@googlegroups.com.
To post to this group, send email to predixcanmetaxcan@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/a75d1c66-4bbb-44e7-9b5f-b0c1709c2f94%40googlegroups.com.

John Rouhana

unread,
Jul 25, 2019, 1:25:16 PM7/25/19
to PrediXcan/MetaXcan
Hi Alvaro,

What is the R2 I see listed in the variant annotation format you provide as an example?

-John

Alvaro Barbeira

unread,
Jul 25, 2019, 3:10:08 PM7/25/19
to John Rouhana, PrediXcan/MetaXcan
Hi John,

R2 is tipically a measure of imputation quality. You can safely set it to NA for the model training pipeline's purpose.

Best,

Alvaro

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.

David Davtian

unread,
May 3, 2020, 3:16:37 PM5/3/20
to PrediXcan/MetaXcan
Hi Alvaro,

Just a quick question. You wrote in your description here, that the companion variant annotation file is supposed to included a MAF column and that the genotype measured genotype data is just IDs and quantification for individuals.
The problem I have is that within the gtex_v7_nested_cv_elnet.R script, on line 20 if I'm not wrong, a function called get_maf_filtered_genotype is using the genotype_file to get the MAF values. yet, since this variable is present only in the snp_annot_file, I think this is triggering an error during my execution of the code.
Since this hasn't been noticed yet I don't know if I am just talking nonsense or if it's a real mistake...

Anyway, thank you in advance for your time,
Best regards,

David
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To post to this group, send email to predixca...@googlegroups.com.

Alvaro Barbeira

unread,
May 4, 2020, 9:16:37 AM5/4/20
to David Davtian, PrediXcan/MetaXcan
Hi David,

The code computes MAF from the genotypes by design. MAF from snp annotation file is ignored.
The motivation is that this code used a GTEx snp annotation file with MAF from all individuals; but the genotypes were only from European ancestry so that effective MAF is different.

If you are having an error , I recommend adding something like:
options(error = traceback)
to the top of the gtex_v7_nested_cv_elnet.R file; this should give a more detailed error profile.

Best,

Alvaro




To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/61419d11-cc7c-45f1-90a5-f7c4a777ab9c%40googlegroups.com.

David Davtian

unread,
May 4, 2020, 10:03:08 AM5/4/20
to PrediXcan/MetaXcan
Hi Alvaro,

Thanks for the reply! I used the traceback and I am getting an R read.table error (detial in the image).
But I guess it is due to the structure or names in my input data.

Anyway, thank you again!

Best,
David
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/61419d11-cc7c-45f1-90a5-f7c4a777ab9c%40googlegroups.com.
error.PNG

Alvaro Barbeira

unread,
May 4, 2020, 10:14:48 AM5/4/20
to David Davtian, PrediXcan/MetaXcan
Hi David,

Thanks for getting back to me. Indeed, the error looks like multiple lines have a same id - the code assumes a singl, distincte variant per row.
In the future, I recommend copy-pasting the actual text from the console output into the email. Copying text from a console is supported by (almost) every terminal/ssh client.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/9fba8c40-a974-48e3-abbb-aa31fd95985d%40googlegroups.com.

David Davtian

unread,
May 4, 2020, 10:29:15 AM5/4/20
to PrediXcan/MetaXcan
Hi,

So I guess this might be because my genotype files are vcf.gz and they still got the header from the original vcf file, I'll try to deal with it thank you!
And concerning the code, since I am on a virtual machine, it's quite tricky to copy paste from there to local...

Thank you again very much,

Best,
David
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/9fba8c40-a974-48e3-abbb-aa31fd95985d%40googlegroups.com.

Alvaro Barbeira

unread,
May 4, 2020, 11:27:51 AM5/4/20
to David Davtian, PrediXcan/MetaXcan
Hi David,

The code's genotype format is rigid. It must be a text file (possibly gzipped compressed) with the following format:

varID IND1 IND2 IND3 ...

1_54421_A_G_b37 1   0 0 ...

...


(variant id can be any string as long as it's unique and properly mapped in the snp annotation file). The code can't handle anything else such as a vcf header at the beginning of the file.

Which virtual machine software are you using? In my experience, either copy-pasting worked, or I would just log in via ssh from the host OS to the virtual machine and work in the host's terminal (or putty the one time I was working on a Windows host with local linux vm).

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/be93aaa1-e3b5-44ed-8245-702ef9f782b6%40googlegroups.com.

David Davtian

unread,
May 4, 2020, 11:35:32 AM5/4/20
to PrediXcan/MetaXcan
Hi Alvaro,

That's what I would guess and that's what I am trying to do right now, trying to clean the vcf.gz files to get rid of the header and any not used column.
I hope it will work
Concerning the vm I am using Citrix, even if copy-pasting from my local to the vm is working the other way is not, I don't know why but I can look for it, it's the lesser of my problems right now...haha

Best,

David
To unsubscribe from this group and stop receiving emails from it, send an email to predixca...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/be93aaa1-e3b5-44ed-8245-702ef9f782b6%40googlegroups.com.

Alvaro Barbeira

unread,
May 4, 2020, 12:00:41 PM5/4/20
to David Davtian, PrediXcan/MetaXcan
Hi David,

Just in case, I have an example script to extract (phased) genotypes from vcfs here, using bcftools. I guess It should be straightforward to adapt to your case.

Best,

Alvaro

To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/744cac76-7a7b-4835-a8db-0c01cc59e439%40googlegroups.com.
Message has been deleted
Message has been deleted

Alvaro Barbeira

unread,
Apr 7, 2021, 5:12:19 PM4/7/21
to Rodrigo Duarte, PrediXcan/MetaXcan
Hi Rodrigo, please see my answers below.

Best,

Alvaro

On Wed, Apr 7, 2021 at 4:37 PM Rodrigo Duarte <rodrigo....@gmail.com> wrote:
Hi folks, thank you for this helpful discussion!

Is there an updated tutorial based on the scripts from https://github.com/hakyimlab/PredictDB_Pipeline_GTEx_v7 ?
Yes and no.
Festus did a workflowr project with a reproducible research approach to showcase Elastic Net model training:
This is extra helpful to learn the internals and you can explore the data and intermediate steps.

There's no equivalent script names, based on the older pipeline's wiki (https://github.com/hakyimlab/PredictDBPipeline/wiki/Detailed_Description), so I am a bit confused as to where to start (to create my own weights for the new S-PrediXcan). Or should I just use the old scripts and run the old PrediXcan?
RIght now, an old message explains all the steps that need be executed, and related file formats, for GTEX v7-like Predictdb:
https://groups.google.com/g/predixcanmetaxcan/c/TkBxYkUpNGw/m/Q_mMApRtCQAJ
 


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages