0% of model SNPs

Zurodher

unread,

Mar 25, 2025, 9:06:26 AM3/25/25

to PrediXcan/MetaXcan

Hi all,

I've been struggling with the "0% of model SNPs used".

I am using a GWAS summstats (build 38) which looks like:

#chrom pos ref alt rsids nearest_genes pval mlogp beta sebeta af_alt af_alt _cases af_alt_controls varID
1 13668 G A rs2691328 DDX11L1 0.635823 0.196664 0.1992 0.4206 53 0.0059906 0.00634106 0.00598931 chr1_13668_G_A_b38
1 14506 G A rs1240557819 WASH7P 0.298196 0.525498 -0.469636 0

I created the "varID" column to match with the Protein ARIC AA weights, which looks like:

gene rsid varID ref_allele eff_allele
1 ENSG00000254521 rs3752135 chr19_51497370_T_G_b38 T G
2 ENSG00000254521 rs3826667 chr19_51500820_C_T_b38 C T
3 ENSG00000254521 rs3810114 chr19_51503097_T_C_b38 T C

46,56 % of the SNPs (33009 SNPs) in the GWAS sumstats match with the ARIC weights (NSNPs= 70881) (for both rsid and varID). However, when trying to match by rsid I got ""0% of model SNPs used" but by matching by varID: "38% of model's snps used". Why I did not get the same (i.e., 38% of model's snps used) when using rsid?

See below the command lines:

- For rsid:

python3 /.../SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/ \
--gwas_file_pattern "sumstat_test.gz" \
--snp_column rsids \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key rsid \
--output_file /.../test.csv

- For varID

python3 /.../SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/ \
--gwas_file_pattern "sumstat_test.gz" \
--snp_column varID \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key varID \
--keep_non_rsid \
--output_file /.../test.csv

I got the same problem ("0% of model SNPs used") when using those weights: https://predictdb.org/post/2022/02/12/protein-prediction-for-trait-mapping-in-diverse-populations/

However here, the number of overlap SNPs is only 168...

Weights df:

gene rsid varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38 G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38 G
eff_allele weight
1 A 0.5365755
2 T -0.4122139

Sumstats is the same as before but removing the "chr" prefix.

Command line:

python3 /.../SPrediXcan.py \
--model_db_path /.../EUR_PCAIR_baseline_models_rho0.1_zpval0.05.db \
--covariance /.../EUR_PCAIR_baseline_models_rho0.1_zpval0.05_covariances.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/test/ \
--gwas_file_pattern ".*gz" \
--snp_column varID \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key varID \
--keep_non_rsid \
--output_file /...test1.csv

Thanks in advance,

Zulema

Hae Kyung Im

unread,

Mar 25, 2025, 2:16:56 PM3/25/25

to Zurodher, Sofía Salazar, PrediXcan/MetaXcan

Sofia,

Could you help with this question?

Thanks

Haky

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/predixcanmetaxcan/55abe360-7214-4463-9641-9306a5d81f1an%40googlegroups.com.

Sofía Salazar

unread,

Mar 25, 2025, 5:03:27 PM3/25/25

to Zurodher, PrediXcan/MetaXcan

Hi,

To answer your question "Why did I not get the same (i.e., 38% of model's snps used) when using rsid?", this might be due to the fact that the covariance file SNP ids need to match as well. I think that the covariance file (ARIC_AA_hg38.txt.gz) might not have rsids, thus why it only works with the varID. You could try checking for overlaps with the covariance file as well. Also, you can look at this reference to see different ways to match model IDs.

Sofia

Zulema Rodríguez Hernández

unread,

Mar 25, 2025, 6:08:01 PM3/25/25

to Sofía Salazar, PrediXcan/MetaXcan

Many thanks Sofia and Hae for the quick reply,

Yes ARIC AA covariance file only includes varID. I though there might exist some kind of internal PrediXcan pipeline to match weights rsids and covariance varID since weights includes both rsids and varID

This brings me to my second question related to these weights: https://predictdb.org/post/2022/02/12/protein-prediction-for-trait-mapping-in-diverse-populations/ - I got the same problem ("0% of model SNPs used").

Here, the number of overlap SNPs is only 168...

I was matching by VarID. However the covariance file only includes rsid. So, I guess I should use rsid for matching. My question is where those rsids come from (I expected rsXXXX) and how I can create this kind of rsid column in my GWAS summary statistics.

Weights df:

gene rsid varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38 G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38 G
eff_allele weight
1 A 0.5365755
2 T -0.4122139

zcat EUR_PCAIR_dapg_0.001_T_rho0.1_zpval0.05_covariances.txt.gz | head
GENE RSID1 RSID2 VALUE
SL012561_ENSG00000134193.15 chr1:119804807 chr1:119804807 0.0552687671860519
SL012561_ENSG00000134193.15 chr1:119804807 chr1:119875876 -4.73361214087117e-05
SL012561_ENSG00000134193.15 chr1:119875876 chr1:119875876 0.208497164689527
SL003340_ENSG00000133048.13 chr1:203179232 chr1:203179232 0.032760034053521

Sofía Salazar

unread,

Mar 26, 2025, 9:47:23 AM3/26/25

to Zulema Rodríguez Hernández, PrediXcan/MetaXcan

Hi,

Since the rsid columns in both the covariance file and the weights seem to be in the format chr#:position, you could create a new column in your summary statistics (let's suppose you name it SNP) in the format chr#:position and use --snp_column SNP and --model_db_snp_key rsid

Sofia

Zulema Rodríguez Hernández

unread,

Mar 26, 2025, 10:21:17 AM3/26/25

to Sofía Salazar, PrediXcan/MetaXcan

Hi,

Yes, rsids are in the format chr:position. However, they do not align with the pos in the variID column. See below an example.

gene rsid varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38 G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38 G

Are they in different genome build? When checking on dbSNP I found this:

1:119804807 (GRCh38)
1:120347430 (GRCh37)

Does the rsid refer to GRCh38 and varID to GRCh38? If so, the sufix "_b38" in the varID column is confusing...

Sofía Salazar

unread,

Mar 26, 2025, 2:58:49 PM3/26/25

to Zulema Rodríguez Hernández, PrediXcan/MetaXcan

Hi,

Thank you for pointing this out. It seems like the varID ids from the model were generated with the GRCH37 position and the suffix was added by default at the moment of creating the model. I will see to it so this mismatch is corrected on our end. But in the meantime and for your application, I think the easiest approach will be to use the rsid column from the model and covariate files as long as you generate a column in your summary statistics with the same format and also that your summary statistics are in grch38 build as well.

Let me know if you have any questions, and thanks again for your input!

Sofia

Zulema Rodríguez Hernández

unread,

Apr 2, 2025, 5:16:52 AM4/2/25

to Sofía Salazar, PrediXcan/MetaXcan

Many thanks Sofia.

Can I use something like this to instruct SPrediXcan.py to create a SNP ID in my summary statistics in the format chr{}_{} to match with the weights rsID?" Or Should I add this column in my sumstats and after that running SPrediXcan.py

python3 /....t/MetaXcan/software/SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../ \
--gwas_file_pattern "gz" \
--snp_column rsID\
--position_column pos \

--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \

--model_db_snp_key rsID\
--on_the_fly_mapping METADATA "chr{}_{}" \
--output_file /.../output2/test.csv