0% of model SNPs used Predict.py

62 views
Skip to first unread message

Kalyani Kottilil

unread,
Feb 18, 2025, 12:01:33 PMFeb 18
to PrediXcan/MetaXcan
Hi all. I've been struggling with the '0% of model SNPs used' error in multiple cohorts, across multiple permutations of the command, for individual level PrediXcan. 

I cloned the most recent version of the software (git clone https://github.com/hakyimlab/MetaXcan.git), followed by creating the conda environment using the .yaml file in the software direcotry. I've attempted this in UK Biobank (hg19) as well as an external cohort (hg38 imputed) - I've tested in both using files in vcf.gz format. 

I've attached images of the commands and outputs in UKB. The first shows a test with the elastic net model and vcf files (hg19) using rsids as the map key. I've also checked for the overlap in rsids between the Elastic net model weights table and the vcf file - it's 13%. The second image shows a test with the mashr model and vcf files, where I've ensured that I've included a chain file for liftover, the database snp key is 'varID', and I've specified that the program do on the fly mapping. 

Elastic net:
Screenshot 2025-02-18 at 11.44.22 AM.png

Mashr:
Screenshot 2025-02-18 at 11.58.10 AM.png

Hae Kyung Im

unread,
Feb 18, 2025, 12:06:07 PMFeb 18
to Kalyani Kottilil, PrediXcan/MetaXcan, Sofía Salazar
Sofia,
could you help with this?
Thanks
Haky

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/predixcanmetaxcan/152114b1-f7f5-411a-ae41-7379a43e4889n%40googlegroups.com.

Kalyani Kottilil

unread,
Feb 18, 2025, 12:18:01 PMFeb 18
to PrediXcan/MetaXcan
For reference, I've also attempted to do this using the UKB bgen data and elastic net models. I end up with the following error:
Screenshot 2025-02-18 at 12.14.39 PM.png

Sofía Salazar

unread,
Feb 18, 2025, 1:28:21 PMFeb 18
to Kalyani Kottilil, PrediXcan/MetaXcan
Hi, 
This issue is common when there are slight mismatches between variant id formats from the genotype and the model. However, your approach looks good to me. In order to help you find the mismatch, could you try:

1. For your first example, it should work as long as all of the variants of your genotype are rsids, you could try checking with :
bcftools query -f '%ID\n' c22_subsetted.vcf.gz | head -3

If that shows that you have rsids, I still would recommend trying the --on_the_fly_mapping method as this will generate a new variant ID for all your variants regardless of their rsid.

2. For the second example, the approach looks correct as not all the MASHR models' variants have rsids. There could be an inconsistency in the mapping format, so I would ask you to try it with --on_the_fly_mapping {}_{}_{}_{}_b38 instead

3. As for your third example, it seems to me like there might be an issue with the genotype file. There was a discussion on this forum about this issue if you want to check it out. Though it doesn't seem like it's been resolved. I would encourage you to try the first two options first.

In any case, please refer to this documentation if you haven't done so, on the different ways to run this script, in your case I think it would correspond to examples 1, 2 and 4. Also, this example run might be useful to compare file formats and variant naming conventions

Please let me know if the issues persist,

Sofia

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.

Kalyani Kottilil

unread,
Feb 18, 2025, 4:05:24 PMFeb 18
to PrediXcan/MetaXcan
Hi Sofia,

Thanks for the tips. I tried out both 1 and 2 and still got 0% of models' SNPs used.. Do you have any advice for next steps? Happy to provide any additional information, if need be. 

Sofía Salazar

unread,
Feb 18, 2025, 4:27:19 PMFeb 18
to PrediXcan/MetaXcan
Hi,

Could you please provide the output of running this command? : bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\n' c22_subsetted.vcf.gz | head -5

Sofia

Kalyani Kottilil

unread,
Feb 19, 2025, 3:43:36 PMFeb 19
to PrediXcan/MetaXcan
Hi,

Here is the output from the bcftools command. 

Screenshot 2025-02-19 at 3.41.21 PM.png

Kalyani Kottilil

unread,
Feb 25, 2025, 11:44:43 AMFeb 25
to PrediXcan/MetaXcan
Hi Sofia,

Wanted to circle back to this and was wondering if you had any additional advice for me. Thank you!

Sofía Salazar

unread,
Feb 25, 2025, 2:06:14 PMFeb 25
to Kalyani Kottilil, PrediXcan/MetaXcan
Hi Kalyani,

I apologize for the late response. Looking at your file, the second example that you sent (the one with the MASHR model) should run correctly. Since some of your variants don't have rsids, you should use --model_db_snp_key varID and --on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" with both the elastic net models and the MASHR models.

Does your c22_subsetted.vcf.gz file contain only chromosome 22 variants? If this were the case, and many of your SNPs were truly not in the model files, you could have less than 1% of the model's SNPs on your vcf and thus return this message, since models have SNPs from the whole genome. Could you please share whether your output file is empty?

Sofia

Reply all
Reply to author
Forward
0 new messages