0% of model SNPs

64 views
Skip to first unread message

Zurodher

unread,
Mar 25, 2025, 9:06:26 AMMar 25
to PrediXcan/MetaXcan
Hi all,
I've been struggling with the "0% of model SNPs used". 
I am using a GWAS summstats (build 38) which looks like:
#chrom  pos     ref     alt     rsids   nearest_genes   pval    mlogp   beta    sebeta  af_alt  af_alt                                 _cases  af_alt_controls varID
1       13668   G       A       rs2691328       DDX11L1 0.635823        0.196664        0.1992  0.4206                                 53      0.0059906       0.00634106      0.00598931      chr1_13668_G_A_b38
1       14506   G       A       rs1240557819    WASH7P  0.298196        0.525498        -0.469636    0     

I created the "varID" column to match with the Protein ARIC AA weights, which looks like:
 gene       rsid                  varID ref_allele eff_allele
1 ENSG00000254521  rs3752135 chr19_51497370_T_G_b38          T          G
2 ENSG00000254521  rs3826667 chr19_51500820_C_T_b38          C          T
3 ENSG00000254521  rs3810114 chr19_51503097_T_C_b38          T          C

46,56 % of the SNPs  (33009 SNPs) in the GWAS sumstats match with the ARIC weights (NSNPs= 70881) (for both rsid and varID). However, when trying to match by rsid I got ""0% of model SNPs used" but by matching by varID: "38% of model's snps used". Why I did not get the same (i.e., 38% of model's snps used) when using rsid?

See below the command lines:
- For rsid:
python3 /.../SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/ \
--gwas_file_pattern "sumstat_test.gz" \
--snp_column rsids \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key rsid \
--output_file /.../test.csv

- For varID
python3 /.../SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/ \
--gwas_file_pattern "sumstat_test.gz" \
--snp_column varID \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key varID \
--keep_non_rsid \
--output_file /.../test.csv


I got the same problem ("0% of model SNPs used") when using those weights: https://predictdb.org/post/2022/02/12/protein-prediction-for-trait-mapping-in-diverse-populations/
However here, the number of overlap SNPs is only 168...

Weights df:
   gene           rsid               varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38          G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38          G
  eff_allele     weight
1          A  0.5365755
2          T -0.4122139

Sumstats is the same as before but removing the "chr" prefix.

Command line:
python3 /.../SPrediXcan.py \
--model_db_path /.../EUR_PCAIR_baseline_models_rho0.1_zpval0.05.db \
--covariance /.../EUR_PCAIR_baseline_models_rho0.1_zpval0.05_covariances.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/test/ \
--gwas_file_pattern ".*gz" \
--snp_column varID \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key varID \
--keep_non_rsid \
--output_file /...test1.csv

Thanks in advance,
Zulema

Hae Kyung Im

unread,
Mar 25, 2025, 2:16:56 PMMar 25
to Zurodher, Sofía Salazar, PrediXcan/MetaXcan
Sofia,
Could you help with this question?
Thanks
Haky

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/predixcanmetaxcan/55abe360-7214-4463-9641-9306a5d81f1an%40googlegroups.com.

Sofía Salazar

unread,
Mar 25, 2025, 5:03:27 PMMar 25
to Zurodher, PrediXcan/MetaXcan
Hi,

To answer your question "Why did I not get the same (i.e., 38% of model's snps used) when using rsid?", this might be due to the fact that the covariance file SNP ids need to match as well. I think that the covariance file (ARIC_AA_hg38.txt.gz) might not have rsids, thus why it only works with the varID. You could try checking for overlaps with the covariance file as well.  Also, you can look at this reference to see different ways to match model IDs.

Sofia

Zulema Rodríguez Hernández

unread,
Mar 25, 2025, 6:08:01 PMMar 25
to Sofía Salazar, PrediXcan/MetaXcan
Many thanks Sofia and Hae for the quick reply,
Yes ARIC AA covariance file only includes varID. I though there might exist some kind of internal PrediXcan pipeline to match weights rsids and covariance varID since weights includes both rsids and varID

This brings me to my second question related to these weights: https://predictdb.org/post/2022/02/12/protein-prediction-for-trait-mapping-in-diverse-populations/  - I got the same problem ("0% of model SNPs used").
Here, the number of overlap SNPs is only 168...

I was matching by VarID. However the covariance file only includes rsid. So, I guess I should use rsid for matching. My question is where those rsids come from (I expected rsXXXX) and how I can create this kind of rsid column in my GWAS summary statistics.

Weights df:
   gene           rsid               varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38          G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38          G
  eff_allele     weight
1          A  0.5365755
2          T -0.4122139

zcat EUR_PCAIR_dapg_0.001_T_rho0.1_zpval0.05_covariances.txt.gz | head
GENE    RSID1   RSID2   VALUE
SL012561_ENSG00000134193.15     chr1:119804807  chr1:119804807  0.0552687671860519
SL012561_ENSG00000134193.15     chr1:119804807  chr1:119875876  -4.73361214087117e-05
SL012561_ENSG00000134193.15     chr1:119875876  chr1:119875876  0.208497164689527
SL003340_ENSG00000133048.13     chr1:203179232  chr1:203179232  0.032760034053521

Sofía Salazar

unread,
Mar 26, 2025, 9:47:23 AMMar 26
to Zulema Rodríguez Hernández, PrediXcan/MetaXcan
Hi,

Since the rsid columns in both the covariance file and the weights seem to be in the format chr#:position, you could create a new column in your summary statistics (let's suppose you name it SNP)  in the format chr#:position and use --snp_column SNP  and --model_db_snp_key rsid

Sofia

Zulema Rodríguez Hernández

unread,
Mar 26, 2025, 10:21:17 AMMar 26
to Sofía Salazar, PrediXcan/MetaXcan
Hi,

Yes, rsids are in the format chr:position. However, they do not align with the pos in the variID column. See below an example. 

   gene           rsid               varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38          G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38          G

Are they in different genome build?  When checking on dbSNP I found this:
1:119804807 (GRCh38)
1:120347430 (GRCh37)

Does the rsid refer to GRCh38 and varID to GRCh38? If so,  the sufix "_b38" in the varID column is confusing...



Sofía Salazar

unread,
Mar 26, 2025, 2:58:49 PMMar 26
to Zulema Rodríguez Hernández, PrediXcan/MetaXcan
Hi,

Thank you for pointing this out. It seems like the varID ids from the model were generated with the GRCH37 position and the suffix was added by default at the moment of creating the model. I will see to it so this mismatch is corrected on our end. But in the meantime and for your application, I think the easiest approach will be to use the rsid column from the model and covariate files as long as you generate a column in your summary statistics with the same format and also that your summary statistics are in grch38 build as well.

Let me know if you have any questions, and thanks again for your input!

Sofia

Zulema Rodríguez Hernández

unread,
Apr 2, 2025, 5:16:52 AMApr 2
to Sofía Salazar, PrediXcan/MetaXcan
Many thanks Sofia.

Can I use something like this to instruct SPrediXcan.py to create a SNP ID in my summary statistics in the format chr{}_{} to match with the weights rsID?"  Or Should I add this column in my sumstats and after that running SPrediXcan.py

python3 /....t/MetaXcan/software/SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../ \
--gwas_file_pattern "gz" \
--snp_column rsID\
--position_column pos \

--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key rsID\
--on_the_fly_mapping METADATA "chr{}_{}" \
--output_file /.../output2/test.csv

Many thanks in advance,
Zulema
Reply all
Reply to author
Forward
0 new messages