#chrom pos ref alt rsids nearest_genes pval mlogp beta sebeta af_alt af_alt _cases af_alt_controls varID
1 13668 G A rs2691328 DDX11L1 0.635823 0.196664 0.1992 0.4206 53 0.0059906 0.00634106 0.00598931 chr1_13668_G_A_b38
1 14506 G A rs1240557819 WASH7P 0.298196 0.525498 -0.469636 0
I created the "varID" column to match with the Protein ARIC AA weights, which looks like:
gene rsid varID ref_allele eff_allele
1 ENSG00000254521 rs3752135 chr19_51497370_T_G_b38 T G
2 ENSG00000254521 rs3826667 chr19_51500820_C_T_b38 C T
3 ENSG00000254521 rs3810114 chr19_51503097_T_C_b38 T C
46,56 % of the SNPs (33009 SNPs) in the GWAS sumstats match with the ARIC weights (NSNPs= 70881) (for both rsid and varID). However, when trying to match by rsid I got ""0% of model SNPs used" but by matching by varID: "38% of model's snps used". Why I did not get the same (i.e., 38% of model's snps used) when using rsid?
See below the command lines:
- For rsid:
python3 /.../SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/ \
--gwas_file_pattern "sumstat_test.gz" \
--snp_column rsids \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key rsid \
--output_file /.../test.csv
- For varID
python3 /.../SPrediXcan.py \
--model_db_path /.../ARIC/ARIC_AA_hg38.db \
--covariance /.../ARIC/ARIC_AA_hg38.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/ \
--gwas_file_pattern "sumstat_test.gz" \
--snp_column varID \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key varID \
--keep_non_rsid \
--output_file /.../test.csv
However here, the number of overlap SNPs is only 168...
Weights df:
gene rsid varID ref_allele
1 SL012561_ENSG00000134193.15 chr1:119804807 1_120347430_G_A_b38 G
2 SL012561_ENSG00000134193.15 chr1:119875876 1_120418499_G_T_b38 G
eff_allele weight
1 A 0.5365755
2 T -0.4122139
Sumstats is the same as before but removing the "chr" prefix.
Command line:
python3 /.../SPrediXcan.py \
--model_db_path /.../EUR_PCAIR_baseline_models_rho0.1_zpval0.05.db \
--covariance /.../EUR_PCAIR_baseline_models_rho0.1_zpval0.05_covariances.txt.gz \
--gwas_folder /.../sumstats_PrediXcan/test/ \
--gwas_file_pattern ".*gz" \
--snp_column varID \
--effect_allele_column alt \
--non_effect_allele_column ref \
--beta_column beta \
--pvalue_column pval \
--model_db_snp_key varID \
--keep_non_rsid \
--output_file /...test1.csv
Thanks in advance,
Zulema