0% SNPs used

254 views
Skip to first unread message

Molly Martorella

unread,
Aug 6, 2020, 7:44:52 PM8/6/20
to PrediXcan/MetaXcan
Hi,

0% of model SNPs are being used when I run the PrediXcan Predict.py script. I'm trying to predict the expression of UKBB white british individuals using the GTEx V8 mashr models.

This is my code (I activate imlabtools and export variables before this) :

i=${SLURM_ARRAY_TASK_ID} #i is chromosomes 1-22
p=$1 #p is UKWB or other population prefix
y=$2 #y is tissue

python $METAXCAN/Predict.py \
    --model_db_path $MODEL/mashr_${y}.db \
    --model_db_snp_key varID \
    --liftover $LIFTOVER/hg19ToHg38.over.chain.gz \
    --bgen_genotypes $GENO/${p}.ImpQC.Chr${i}.bgen \    
    --on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b37" \
    --prediction_output $OUTPUT/${y}/${p}_${y}_chr${i}.predict.txt \
    --prediction_summary_output $OUTPUT/${y}/${p}_${y}_chr${i}.summary.txt \
    --verbosity 9 \
    --throw

I'm not sure if the issue is with on_the_fly_mapping. I thought maybe since 0% were used there is some issue with the liftover, and it had to do with this mapping. The BGEN genotype file just has chromosome number and not chr before it, so it shouldn't be outputting chrchr etc. I tried changing this around to b38 etc. and including/exlcuding the chr in case I misinterpretted what it's doing but none of that worked.

I also tried just using bgen_use_rsid and omitting the model_db_snp_key since the db files do have rsids, but that also did not work (didn't have high hopes for that though).

Thanks for your help,
Molly

Alvaro Barbeira

unread,
Aug 18, 2020, 12:04:22 PM8/18/20
to Molly Martorella, PrediXcan/MetaXcan
Hi Molly,

I recommend you use "chr{}_{}_{}_{}_b38" for the --on_the_fly_mapping argument. 

This parameter specifies a format string to be used when building a variant id from its (liftover coordinates) and alleles. 
Then these ids will be matched to the ids in the mashr model which are all hg38-based.

Best,

Alvaro

--
You received this message because you are subscribed to the Google Groups "PrediXcan/MetaXcan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to predixcanmetax...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/predixcanmetaxcan/8d4b0bba-90fe-4603-9db0-aec2009998c2n%40googlegroups.com.

haky

unread,
Dec 1, 2020, 11:23:06 PM12/1/20
to PrediXcan/MetaXcan
Hi Molly

were you able to address this issue? For UK Biobank scale prediction, you may need to use a more customized approach. Here is what Milton Pividori and Yanyu Liang used https://github.com/hakyimlab/predixcan_prediction

I hope this helps
Haky

Jeremy Elman

unread,
Apr 26, 2021, 4:19:47 PM4/26/21
to PrediXcan/MetaXcan
Hello,

I am also running into a problem when trying to predict expression for UKB data using GTEx v8 mashr models. My call to Predict.py is essentially the same as Molly's above but using the argument "--on_the_fly_mapping chr{}_{}_{}_{}_b38". 

I get the error message "BGEN code doessn't support variant mapping through a method". Looking at  the function bgen_file_geno_lines() in BGENGenotype.py, it appears that only dict mapping is supported at this time. Is there a recommended file I could specify with --variant_mapping or any plans to support on the fly mapping for bgen data in the future? 

Maybe the customized approach noted above is the way to go? It's only a small subset of subjects, so the standard implementation should be fine for me if I can get the mapping to work.

Thank you in advance for your help!
Jeremy

Alvaro Barbeira

unread,
Apr 26, 2021, 5:03:47 PM4/26/21
to Jeremy Elman, PrediXcan/MetaXcan
Hi Jeremy,

Indeed, there are currently no plans to support on-the-fly-mapping for BGEN.
You would have to create a tabular file mapping each of your input variants to our models variants

The argument you would need to use is:
--variant_mapping /path/to/tab-separated-file.txt KEYNAME VALUENAME
where:
a) path refers to a simple tabular file
b) KEYNAME is the name of the column with your input variants
c) VALUENAME is the name of the column with the mapped variant as existing in a model (NA would cause the model to be ignored)

The script mentioned in a previous version is a simplified version of predixcan on BGENs that might be easier to customize - but you'll basically be redoing your own way the above mapping. I think you'll be better off just computing the mapping you need instead of developing a custom on-the-fly mapping on the simplified predixcan.

Best,

Alvaro

Jeremy Elman

unread,
Apr 26, 2021, 5:14:49 PM4/26/21
to PrediXcan/MetaXcan
Hi Alvaro,

Thank you for the quick answer! That makes sense and I agree that using the standard version with a custom mapping file is probably the most straightforward approach here. 

Thanks,
Jeremy

Hyunchan Ahn

unread,
Jun 16, 2023, 12:04:14 PM6/16/23
to PrediXcan/MetaXcan

Hi everyone,

This was really helpful. With the perspective of running MASHR models on UKB .bgens, we sought to test the manual mapping via --variant_mapping against --bgen_use_rsid in the EN models first.

We subsetted the c1 .bgen based on rsids using bgenix first (now containing the first 10 variants with unique rsIDs that were considered by the EN model). 

Using --bgen_use_rsid worked flawlessly.

Next, we generated a tab separated file like so (alternate_ids obtained from bgenix -list as ID):

 

ID        rsid

1:10000_A_G rs123

1:20000_C_G  rs234

rs345   rs345

 

Finally, we used the command “--variant_mapping file.txt ID rsid”.

We obtain predictions considering only such variants which are stored as rsid (in this example e. g. rs345), but not for such which are encoded by chromosome:position_ref_alt.

We tested multiple things without success, including:

  • using 01 or 1 for the chromosome.
  • changing the separators in the .txt file and using --force_colon

Did we potentially get the variant identifiers wrong (i. e. are they not what’s listed by the bgenix -list output under “alternate_ids”)?

Any help would be highly appreciated.

 

BW,

Hyunchan


2021년 4월 26일 월요일 오후 10시 14분 49초 UTC+1에 Jeremy Elman님이 작성:
Reply all
Reply to author
Forward
0 new messages