Re: Request for Assistance with S-PrediXcan on UK Biobank WGS Data

90 views

Skip to first unread message

Hae Kyung Im

unread,

Jun 18, 2025, 3:23:05 PM6/18/25

to Faisal Imran, Festus Nyasimi, predixca...@googlegroups.com

Festus,

could you help with this?

Thanks

Haky

On Tue, Jun 10, 2025 at 9:50 AM Faisal Imran <fimran.ms...@seecs.edu.pk> wrote:

Thank you Hae for your prompt reply. The file we used is downloaded from one of the open sources (https://www.ebi.ac.uk/gwas/downloads/summary-statistics) since the original ukbiobank's data does not include any generated GWAS summary statistics. We are actually targeting the liver-based models, and GWAS summary data. We tried to run this with the provided example model as well, but it does not give any result. We have attached the log when trying to use the GWAS downloaded file using the source mentioned and using the Liver model from https://zenodo.org/records/3518299 .
Another important question is how can we prepare ukbiobank's data for S-PrediXcan?

Any help in this regard would be highly appreciated.

Here are the details that you have asked for.

head:

chromosome variant_id base_pair_location effect_allele other_allele effect_allele_frequency beta standard_error p_value
1 rs146836579 87647 C T 0.00215931 0.0429659 0.327721 0.895692
1 rs7545609 90051 T C 0.00216346 0.0435438 0.327723 0.894298
1 rs546872994 136113 T C 0.00121655 -0.700075 0.431632 0.104819
1 NA 267404 T TATA 0.00414028 -0.174513 0.237157 0.46182
1 rs554909596 458823 TA T 0.0026455 -0.0147158 0.290358 0.959579
1 rs28863004 526736 G C 0.00291971 -0.683316 0.284223 0.01621
1 rs557203750 559985 T C 0.00548926 -0.00460037 0.20235 0.981862
1 rs564040090 562147 A T 0.0125908 -0.0630268 0.136551 0.644396
1 rs569899510 563812 T G 0.00240385 0.114382 0.305065 0.707704

command: cd /opt/notebooks/my_work/MetaXcan/software ./SPrediXcan.py \ --model_db_path /opt/notebooks/my_work/eqtl/mashr/mashr_Liver.db \ --covariance /opt/notebooks/my_work/eqtl/mashr/mashr_Liver.txt.gz \ --gwas_file /opt/notebooks/ukb_data/GCST90103908_processed.tsv \ --snp_column SNP \ --effect_allele_column effect_allele \ --non_effect_allele_column non_effect_allele \ --beta_column beta \ --pvalue_column pvalue \ --keep_non_rsid \ --model_db_snp_key varID \ --output_file /opt/notebooks/my_work/results/linoleic_acid_spredixcan.csv

output: WARNING - Missing --gwas_h2 and --gwas_N are required to calibrate the pvalue and zscore. INFO - Processing GWAS command line parameters INFO - Building beta for /opt/notebooks/ukb_data/GCST90103908_processed.tsv and /opt/notebooks/my_work/eqtl/mashr/mashr_Liver.db INFO - Reading input gwas with special handling: /opt/notebooks/ukb_data/GCST90103908_processed.tsv INFO - Processing input gwas INFO - Aligning GWAS to models /opt/notebooks/my_work/MetaXcan/software/metax/misc/GWASAndModels.py:15: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. alleles_1 = pandas.Series([set(e) for e in zip(merged[EA], merged[NEA])]) /opt/notebooks/my_work/MetaXcan/software/metax/misc/GWASAndModels.py:16: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. alleles_2 = pandas.Series([set(e) for e in zip(merged[EA_BASE], merged[NEA_BASE])]) INFO - Trimming output INFO - Successfully parsed input gwas in 3.0287915779990726 seconds INFO - Started metaxcan process INFO - Loading model from: /opt/notebooks/my_work/eqtl/mashr/mashr_Liver.db INFO - Loading covariance data from: /opt/notebooks/my_work/eqtl/mashr/mashr_Liver.txt.gz INFO - Processing loaded gwas INFO - Started metaxcan association INFO - 0 % of model's snps used WARNING - IMPORTANT: The pvalue and zscore are uncalibrated for inflation INFO - Sucessfully processed metaxcan association in 2.3774969240002974 seconds

Best,
Faisal

On Thu, Jun 5, 2025 at 7:39 PM Hae Kyung Im <ha...@uchicago.edu> wrote:
Hi Faisal,
those problems have to do with mismatch between SNP names in the model vs the genotype files. PrediXcan and TWAS in general are used with common variants, so there is no benefit in using WGS vs the imputed genotype data in the UK Biobank. Please send the head of your genotype files and the exact command you are using to PrediXcan/MetaXcan <predixca...@googlegroups.com>.
Haky

On Thu, Jun 5, 2025 at 2:31 AM Faisal Imran <fimran.ms...@seecs.edu.pk> wrote:
Dear Haky,

I hope you are doing well. We are collaborating with Theranostics Laboratory on generating liver transcriptomes using UK Biobank data and have been attempting to run S-PrediXcan for this project. Although the example scripts in the repository execute successfully, we encounter errors when applying the provided model to the UKBB whole-genome sequencing (WGS) data. Additionally, when we use the GTEx V8 models from PredictDB, we observe 0 % SNP coverage.

We would greatly appreciate any guidance you can offer to help us integrate S-PrediXcan with UK Biobank WGS. We have also explored WGS data from sources such as EBI but have not been successful.

Thank you for your time and assistance.

Best,
Faisal

Festus

unread,

Jun 20, 2025, 10:11:29 AM6/20/25

to PrediXcan/MetaXcan

Hi Faisal,

I have seen a few issues with your code;

In the command line argument –snp_column you are using SNP but in your summary statistics you have variant_id. It should be the column name of the identifier you want to match with the id id used in the database. So if you meant to use column with rsid then it should be –snp_column variant_id
The mashr models don’t have rsid so you can’t directly use the rsid in your sum stats to map to the model weights. You will need to generate another column with variant_id in the sumstats with the following format chr_pos_ref_alt_b38 (b_38 is the genome build for SNPs in mashr models). The –snp_column should reference this column.
You will need to use –keep_non_rsid flag in your cmd because again these models don't have rsids.