h2 smaller when using 1000Genomes as reference, compared to eur_w_ld_chr

338 views
Skip to first unread message

Guido Biele

unread,
Jun 8, 2017, 3:55:59 AM6/8/17
to ldsc_users
Hello,

I have calulated h2 with ldsc.py:
- once using the eur_w_ld_chr LD scores and the snplist provide here: https://data.broadinstitute.org/alkesgroup/LDSCORE/
- once based on LD scores I calculated from 100Genomes data and a sniplist that includes all SNP with MAF > .05  (details below)

My problem is that the analysis based on eur_w_ld_chr results in a h2 of  0.1379 (based on 1149619 remianing SNPs) but the analysis based on the 1000genomes data results in a h2 of 0.0426 (4402390) remaining SNPs.

This can't be right. Has anybody a suggestion as to where I might have gone wrong?

Guido

Here is my workflow 

#### get plink files and calculate LD scores ####
# $i is a the chromosom variable
fn
=ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
plink
--vcf $fn --cm-map 1000GP_Phase3/genetic_map_chr@_combined_b37.txt --make-bed --out plink/$i
ldsc
.py --bfile plink/$i --l2 --ld-wind-cm 1 --out LD/$i  

#### make summary stats file ####
# $pheno is the penotype of interest
munge_sumstats
.py --sumstats $pheno.1000G.EUR.txt --N $N --out $pheno.1000G.EUR --merge-alleles sniplist.1000G.EUR.QC.txt

#### calculate h2 ####
# $LDroot is the path to the directory with all LD files
ldsc
.py --h2 $pheno.1000G.EUR.sumstats.gz --ref-ld-chr $LDroot --w-ld-chr $LDroot --out h2/$pheno.1000G.EUR.h2




####### log files ########

###########
## munge ##
###########
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./munge_sumstats.py \
--out ADHD.1000G.EUR \
--merge-alleles sniplist.1000G.EUR.QC.txt \
--N 17666.0 \
--sumstats ADHD.1000G.EUR.txt 

Interpreting column names as follows:
a1: Allele 1, interpreted as ref allele for signed sumstat.
pval: p-Value
Zscore: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)
snpid: Variant ID (e.g., rs number)
a2: Allele 2, interpreted as non-ref allele for signed sumstat.

Reading list of SNPs for allele merge from sniplist.1000G.EUR.QC.txt
Read 5961159 SNPs for allele merge.
Reading sumstats from ADHD.1000G.EUR.txt into memory 5000000 SNPs at a time.
Read 5193804 SNPs from --sumstats file.
Removed 0 SNPs not in --merge-alleles.
Removed 0 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 791207 variants that were not SNPs or were strand-ambiguous.
4402597 SNPs remain.
Removed 0 SNPs with duplicated rs numbers (4402597 SNPs remain).
Using N = 17666.0
Median value of Zscore was -0.001, which seems sensible.
Removed 207 SNPs whose alleles did not match --merge-alleles (4402390 SNPs remain).
Writing summary statistics for 5961159 SNPs (4402390 with nonmissing beta) to ADHD.1000G.EUR.sumstats.gz.

Metadata:
Mean chi^2 = 1.034
Lambda GC = 1.037
Max chi^2 = 28.0
0 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Wed Jun  7 10:09:45 2017
Total time elapsed: 1.0m:35.26s

################################
## h2 with 1000genomes LD scores ##
################################
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 ADHD.1000G.EUR.sumstats.gz \
--ref-ld-chr LD/ \
--out h2/ADHD.1000G.EUR.h2 \
--w-ld-chr LD/ 

Beginning analysis at Wed Jun  7 22:49:50 2017
Reading summary statistics from ADHD.1000G.EUR.sumstats.gz ...
Read summary statistics for 4402390 SNPs.
Reading reference panel LD Score from LD/[1-22] ...
Read reference panel LD Scores for 81011404 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from LD/[1-22] ...
Read regression weight LD Scores for 81011404 SNPs.
After merging with reference panel LD, 4402390 SNPs remain.
After merging with regression SNP LD, 4402390 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.0297 (0.0196)
Lambda GC: 1.0375
Mean Chi^2: 1.0342
Intercept: 1.0202 (0.0068)
Ratio: 0.5894 (0.1975)
Analysis finished at Wed Jun  7 23:00:24 2017
Total time elapsed: 10.0m:33.92s

###############################
## h2 with eur_w_ld_chr LD scores ##
###############################
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 ADHD.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--out h2/ADHD.h2 \
--w-ld-chr eur_w_ld_chr/ 

Beginning analysis at Tue May 30 17:06:26 2017
Reading summary statistics from ADHD.sumstats.gz ...
Read summary statistics for 1086020 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1290028 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1290028 SNPs.
After merging with reference panel LD, 1085328 SNPs remain.
After merging with regression SNP LD, 1085328 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.0732 (0.0306)
Lambda GC: 1.0345
Mean Chi^2: 1.0338
Intercept: 1.0073 (0.0075)
Ratio: 0.2151 (0.2218)
Analysis finished at Tue May 30 17:06:40 2017
Total time elapsed: 14.08s




PS: For these analyses I used a sniplist based on the these frequencies: https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_frq.tgz. But correponding LD score files have L2 column, and I couldn't figure out which was the correct LD score to use, so I calculated them myself from 1000Genomes data. This is not great because afaik I didn't slect only europeans in the 1000 genomes data when calculating LD scores on my own, but I don't think this is the problem here, becasue the results are the same when I use the MAF from the plink files I used to make my own LD scores to make a sniplist.


Raymond Walters

unread,
Jun 8, 2017, 5:01:54 AM6/8/17
to Guido Biele, ldsc_users
Hi Guido,
Two likely factors here to possibly explain this difference:

1) Have your GWAS results been strictly filtered for imputation quality and per-variant sample size? Including variants with lower imputation quality and/or lots of missingness will tend to deflate h2 estimates. One of the primary motivations for using the HapMap3 SNP list is that it’s relatively safe to assume those variants are well imputed and present in most/all cohorts even if sample size and imputation info aren’t available for the study (as appears to be the case from your munge_sumstats log).

2) Computing LD scores from all 1000 Genomes samples rather than just Europeans will have a substantial impact on the scores. In addition to MAF differences, LD patterns also differ significantly between continental populations, so your LD scores from merged 1KG populations will be a poor proxy for European LD. This will add both error and directional biases (given mean LD differs by population) that will impact the h2 estimate.

Hope that helps.

Cheers,
Raymond

--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/56590fbd-8945-42ab-99a5-7dfc2ca24c5d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Guido Biele

unread,
Jun 8, 2017, 8:42:06 AM6/8/17
to ldsc_users, guido...@gmail.com
Thanks,
this is usefull!
(my background is not in genetics, as you have undoubtedly realized by now),

I am primarily using public available gwas summary data (based on links from LDhub). I think i added sample size.

I'll recalculate the LD scores based in the european sample from the 1000Genomes correct and try it again.

My assumption was that more SNPs should mean higher h2, but this might not be the case of the added SNPs have a worse imputation quality (though it appears to me that most GWAS i looked at used many more SNPs that in the HapMap3 SNP list.

guido
Reply all
Reply to author
Forward
0 new messages