Hello,
I have calulated h2 with ldsc.py:
- once based on LD scores I calculated from 100Genomes data and a sniplist that includes all SNP with MAF > .05 (details below)
My problem is that the analysis based on eur_w_ld_chr results in a h2 of 0.1379 (based on 1149619 remianing SNPs) but the analysis based on the 1000genomes data results in a h2 of 0.0426 (4402390) remaining SNPs.
This can't be right. Has anybody a suggestion as to where I might have gone wrong?
Guido
Here is my workflow
#### get plink files and calculate LD scores ####
# $i is a the chromosom variable
fn=ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
plink --vcf $fn --cm-map 1000GP_Phase3/genetic_map_chr@_combined_b37.txt --make-bed --out plink/$i
ldsc.py --bfile plink/$i --l2 --ld-wind-cm 1 --out LD/$i
#### make summary stats file ####
# $pheno is the penotype of interest
munge_sumstats.py --sumstats $pheno.1000G.EUR.txt --N $N --out $pheno.1000G.EUR --merge-alleles sniplist.1000G.EUR.QC.txt
#### calculate h2 ####
# $LDroot is the path to the directory with all LD files
ldsc.py --h2 $pheno.1000G.EUR.sumstats.gz --ref-ld-chr $LDroot --w-ld-chr $LDroot --out h2/$pheno.1000G.EUR.h2
####### log files ########
###########
## munge ##
###########
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./munge_sumstats.py \
--out ADHD.1000G.EUR \
--merge-alleles sniplist.1000G.EUR.QC.txt \
--N 17666.0 \
--sumstats ADHD.1000G.EUR.txt
Interpreting column names as follows:
a1: Allele 1, interpreted as ref allele for signed sumstat.
pval: p-Value
Zscore: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)
snpid: Variant ID (e.g., rs number)
a2: Allele 2, interpreted as non-ref allele for signed sumstat.
Reading list of SNPs for allele merge from sniplist.1000G.EUR.QC.txt
Read 5961159 SNPs for allele merge.
Reading sumstats from ADHD.1000G.EUR.txt into memory 5000000 SNPs at a time.
Read 5193804 SNPs from --sumstats file.
Removed 0 SNPs not in --merge-alleles.
Removed 0 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 791207 variants that were not SNPs or were strand-ambiguous.
4402597 SNPs remain.
Removed 0 SNPs with duplicated rs numbers (4402597 SNPs remain).
Using N = 17666.0
Median value of Zscore was -0.001, which seems sensible.
Removed 207 SNPs whose alleles did not match --merge-alleles (4402390 SNPs remain).
Writing summary statistics for 5961159 SNPs (4402390 with nonmissing beta) to ADHD.1000G.EUR.sumstats.gz.
Metadata:
Mean chi^2 = 1.034
Lambda GC = 1.037
Max chi^2 = 28.0
0 Genome-wide significant SNPs (some may have been removed by filtering).
Conversion finished at Wed Jun 7 10:09:45 2017
Total time elapsed: 1.0m:35.26s
################################
## h2 with 1000genomes LD scores ##
################################
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./ldsc.py \
--h2 ADHD.1000G.EUR.sumstats.gz \
--ref-ld-chr LD/ \
--out h2/ADHD.1000G.EUR.h2 \
--w-ld-chr LD/
Beginning analysis at Wed Jun 7 22:49:50 2017
Reading summary statistics from ADHD.1000G.EUR.sumstats.gz ...
Read summary statistics for 4402390 SNPs.
Reading reference panel LD Score from LD/[1-22] ...
Read reference panel LD Scores for 81011404 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from LD/[1-22] ...
Read regression weight LD Scores for 81011404 SNPs.
After merging with reference panel LD, 4402390 SNPs remain.
After merging with regression SNP LD, 4402390 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.0297 (0.0196)
Lambda GC: 1.0375
Mean Chi^2: 1.0342
Intercept: 1.0202 (0.0068)
Ratio: 0.5894 (0.1975)
Analysis finished at Wed Jun 7 23:00:24 2017
Total time elapsed: 10.0m:33.92s
###############################
## h2 with eur_w_ld_chr LD scores ##
###############################
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./ldsc.py \
--h2 ADHD.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--out h2/ADHD.h2 \
--w-ld-chr eur_w_ld_chr/
Beginning analysis at Tue May 30 17:06:26 2017
Reading summary statistics from ADHD.sumstats.gz ...
Read summary statistics for 1086020 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1290028 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1290028 SNPs.
After merging with reference panel LD, 1085328 SNPs remain.
After merging with regression SNP LD, 1085328 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.0732 (0.0306)
Lambda GC: 1.0345
Mean Chi^2: 1.0338
Intercept: 1.0073 (0.0075)
Ratio: 0.2151 (0.2218)
Analysis finished at Tue May 30 17:06:40 2017
Total time elapsed: 14.08s
PS: For these analyses I used a sniplist based on the these frequencies:
https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_frq.tgz. But correponding LD score files have L2 column, and I couldn't figure out which was the correct LD score to use, so I calculated them myself from 1000Genomes data. This is not great because afaik I didn't slect only europeans in the 1000 genomes data when calculating LD scores on my own, but I don't think this is the problem here, becasue the results are the same when I use the MAF from the plink files I used to make my own LD scores to make a sniplist.