Hi, first of all, thank you so much for this wonderful tool! It’s helping me a lot with my analysis!
This week, I wanted to obtain a plot similar to one generated with MetaXcan using Elastic Net models (from the article: Multi-ethnic genome-wide association study for atrial fibrillation):
I used Elastic Net models as well (more on that below) and then wanted to run SPrediXcan with 5 GWAS files using the MASHR-M models that you recommend on your GitHub to predict associations between gene expression and a disease. To do this, I harmonized my GWAS files following this tutorial: https://github.com/hakyimlab/MetaXcan/wiki/Tutorial:-GTEx-v8-MASH-models-integration-with-a-Coronary-Artery-Disease-GWAS.
I didn’t apply a liftover because the reference genome Hg38 was used for my GWAS analysis.
I also didn’t perform imputation because the GWAS files already included imputed SNPs.
So, I ran SPrediXcan with my 5 harmonized GWAS files and 13 MASHR-M models.
As stated in the title of this post, I got:
I have been told that these values are very strange. So, I was wondering if perhaps there was a mistake in the way I ran the commands or in the inputs that I used.
I’ll be very grateful if someone can point out any mistakes I might have made.
I’ll detail the commands, parameters, and files used for the harmonization and association analysis with SPrediXcan. Please feel free to point out any errors.
For the harmonization, I followed this tutorial: https://github.com/hakyimlab/MetaXcan/wiki/Tutorial:-GTEx-v8-MASH-models-integration-with-a-Coronary-Artery-Disease-GWAS.
As mentioned before, I didn’t apply a liftover because the reference genome Hg38 was used for my GWAS analysis.
I also didn’t perform imputation because the GWAS files already contained imputed SNPs.
I used the 1000 Genomes-based genotype panel linked here: https://zenodo.org/records/3518299. File: reference_panel_1000G/variant_metadata.txt.gz). As done in the tutorial.
I’m very new to this kind of analysis. Perhaps I should have used data/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz (a SNP annotation file, listing all GTEx v8 variants with MAF>0.01 in Europeans) instead of reference_panel_1000G/variant_metadata.txt.gz as done in this tutorial: https://github.com/hakyimlab/summary-gwas-imputation/wiki/Reference-Data-Set-Compilation#variant-selectionvariant-metadata?
The sample sizes of the GWAS files ranged from 366, 510, 828, 1,335 to 10,087.
The harmonization process for the 5 GWAS files used the following parameters:
```bash
# Paths to input and output directories
GWAS_TOOLS="/path/to/summary-gwas-imputation/src"
# Paths to input and output directories
GWAS_TOOLS="/path/to/summary-gwas-imputation/src" # Counter of GWAS files
g=0
# Removed the quotes to allow glob expansion of the path to GWAS files
for gwas in ${GWAS_DIR}/*.txt.gz;
do
((g++))
gwas_id=$(echo $gwas | sed -re 's/.*somes_(.*)_with.*/\1/')
echo "$ IDº${g}: $gwas_id"
echo "Starting GWAS harmonization of $gwas_id"
# Running gwas_parsing.py to do harmonization
python3 "${GWAS_TOOLS}/gwas_parsing.py" \
-gwas_file ${gwas} \
-snp_reference_metadata "${DATA}/reference_panel_1000G/variant_metadata.txt.gz" METADATA \
-output_column_map rsID variant_id \
-output_column_map Allele2 non_effect_allele \
-output_column_map Allele1 effect_allele \
-output_column_map BETA effect_size \
-output_column_map PVAL pvalue \
-output_column_map N sample_size \
-output_column_map CHR chromosome \
-output_column_map BP position \
-output_column_map AF frequency \
-output_column_map SCORE zscore \
-output_column_map SE standard_error \
-output_order variant_id panel_variant_id chromosome \
position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size \
-output "${OUTPUT}/harmonized_gwas/${gwas_id}_harmonized.txt.gz"
echo -e "Finished GWAS harmonization of ${gwas_id}\n"
done
```
I obtained the following execution log for 4 of the 5 gwas files:
```
$ IDº1
Starting GWAS harmonization of gwas #1
INFO - Parsing input GWAS
INFO - loaded 13760387 variants
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 12092497 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 11384727 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
INFO - Saving...
INFO - Finished converting GWAS in 1132.5789962540002 seconds
Finished GWAS harmonization of gwas #1
$ IDº2
Starting GWAS harmonization of gwas #2
INFO - Parsing input GWAS
INFO - loaded 9601733 variants
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 8296822 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 7766008 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
INFO - Saving...
INFO - Finished converting GWAS in 511.5262082509944 seconds
Finished GWAS harmonization of gwas #2
$ IDº3
Starting GWAS harmonization of gwas #3
INFO - Parsing input GWAS
INFO - loaded 8077645 variants
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 7396625 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 6966141 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
INFO - Saving...
INFO - Finished converting GWAS in 470.3015594979952 seconds
Finished GWAS harmonization of gwas #3
$ IDº4
Starting GWAS harmonization of gwas #4
INFO - Parsing input GWAS
INFO - loaded 10362089 variants
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 8965759 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 8403427 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
INFO - Saving...
INFO - Finished converting GWAS in 553.1050925870004 seconds
Finished GWAS harmonization of gwas #4
```
I didn't got any error message.
Then, I ran SPrediXcan using the newly harmonized GWAS with 13 MASHR models.
The MASHR-M models were downloaded from here: https://zenodo.org/records/3518299 (file: mashr_eqtl.tar).
I used the following parameters:
```bash
GWAS_DIR="/path/to/harmonized_gwas"
MODELS_DIR="/path/to/data/models/eqtl/mashr"
OUTPUT_DIR="/path/to/results/mashr_integration/spredixcan"
LOG_DIR="/path/to/logs/mashr_models"
# Path to SPredixcan executable
SPrediXcan="/path/to/MetaXcan/software/SPrediXcan.py"
# Counter of GWAS files
g=0
# Removed the quotes to allow glob expansion of the path to GWAS files
for gwas in ${GWAS_DIR}/*.txt.gz;
do
((g++))
# echo $gwas
gwas_id=$(echo $gwas | sed -re 's/.*\/(.*)_harmonized\.txt\.gz/\1/')
echo "$ IDº${g}: $gwas_id"
# Counter of tissue models
t=0
for model in ${MODELS_DIR}/*.db;
do
# Increase counter of tissue models
((t++))
# Extract the name of the tissue from each model file
tissue=$(basename $model | sed -re 's/mashr_(.*)\.db/\1/')
echo "** Nº${t}, Processing GWAS: $gwas_id, Model tissue: $tissue **"
# Building paths to covariance matrix and output files:
covariance_path="${MODELS_DIR}/mashr_${tissue}.txt.gz"
output_path="${OUTPUT_DIR}/gwas_${gwas_id}_tissue_${tissue}.csv"
echo "Model: $model"
echo "Covariance: $covariance_path"
echo "Output path: $output_path"
# Running SPrediXcan.py
python3 "$SPrediXcan" \
--model_db_path ${model} \
--covariance ${covariance_path} \
--gwas_file ${gwas} \
--gwas_file_pattern "*.txt.gz" \
--snp_column panel_variant_id \
--effect_allele_column effect_allele \
--non_effect_allele_column non_effect_allele \
--zscore_column zscore \
--beta_column effect_size \
--pvalue_column pvalue \
--position_column position \
--chromosome_column chromosome \
--freq_column frequency \
--keep_non_rsid \
--additional_output \
--model_db_snp_key varID \
--throw \
--output_file ${output_path}
done
echo -e "All ${t} tissues processed for the ${gwas_id} cohort.\n"
done
```
More than 70% of the SNPs in the model were used in all 65 executions (5 GWAS x 13 MASHR tissue models). The execution log can be found attached to this message.
Before using the MASHR models, I used the Elastic Net models. After applying the Bonferroni correction to the raw P-values, few genes were significant (less than 20 had a P-value less than 0.05) across the 245 CSV files (5 GWAS x 49 models).
I wanted to use the MASHR-M models to see what results I could obtain with these models.
In stark contrast to the outputs from SPrediXcan using non-harmonized GWAS and Elastic Net models now, as detailed above, I obtain very different results: a large number of significant genes, some with extremely small P-values or P-values equal to zero, even after Bonferroni correction, and some with very large or small effect sizes.
I'll be very grateful if someone can shed some light on my process and what could possibly have been my mistake.
Thank you so much in advance for your kind help.