Replicating MDD heritability Estimation

372 views
Skip to first unread message

Shing Wan Choi

unread,
Nov 17, 2015, 12:25:39 AM11/17/15
to ldsc_users
Hi Brendan,

We are currently trying to run heritability estimation on the PGC MDD data set as stated in the "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies" paper. 

In the supplementary, it was stated that the heritability estimation of MDD after adjusting for liability threshold is around 0.409. However, when we perform LDSC on the PGC MDD daat, we only get an estimate of 0.1783. 

Our procedure were as follow:

#Prepare LD Score
1. Download 1000g vcf file 
2. plink --vcf <vcf files>  --biallelic-only --make-bed --out <output>
3. Extract all EUR samples 
3. plink --bfile <EUR output> --make-bed --out <EUR output>.filter --maf 0.002    #Remove singletons (1/503)
4, python ldsc.py --bfile  <EUR output>.filter --l2 --ld-wind-kb 1000 --out  EUR      #We would like to use 1mb window instead of cM

#Prepare test statistic (lift over to hg19)
awk 'NR!=1 {print "chr"$2" "$3-1" "$3" "$1}' pgc.mdd.full.2012-04.txt > mdd.hg18.bed
liftOver mdd.hg18.bed hg18ToHg19.over.chain.gz mdd.hg19.bed mdd.hg19.failed
awk 'NR==FNR {hash[$4]=$3;next}{printf $1"\t"$2"\t"hash[$1]; for(i=4; i <= NF; ++i){printf "\t"$i}; print""}' mdd.hg19.bed pgc.mdd.full.2012-04.txt > pgc.mdd.hg19.txt

 #munge 
python munge_sumstats.py --sumstats pgc.mdd.hg19.txt --out mdd.ldsc  --snp snpid --p pval --N-cas 9240 --N-con 9519 --signed-sumstats or,1

#Perform LD Score Regression
python ldsc.py --h2 mdd.ldsc.sumstats.gz --ref-ld EUR --out mdd.ldsc.res --samp-prev 0.4925636 --pop-prev 0.15 --print-coefficients --w-ld EUR



#Result:
Beginning analysis at Tue Nov 17 10:36:51 2015
Reading summary statistics from mdd.ldsc.sumstats.gz ...
Read summary statistics for 905020 SNPs.
Reading reference panel LD Score from EUR ...
Read reference panel LD Scores for 23735408 SNPs.
Reading regression weight LD Score from EUR ...
Read regression weight LD Scores for 23735408 SNPs.
After merging with reference panel LD, 895562 SNPs remain.
After merging with regression SNP LD, 895562 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Liability scale h2: 0.1783 (0.0327)
Lambda GC: 1.0741
Mean Chi^2: 1.0793
Intercept: 1.016 (0.0085)
Ratio: 0.2023 (0.1068)
Analysis finished at Tue Nov 17 10:40:17 2015
Total time elapsed: 3.0m:25.83s




Is this the correct use of LDSC? If so why did the result differ so much?

Thank you for your help!

Sam

Raymond Walters

unread,
Nov 17, 2015, 12:21:06 PM11/17/15
to Shing Wan Choi, ldsc_users
Hi Sam,
Could you please post your log file from computing the LD Scores? Your LD Score regression results show 23 million SNPs in the reference panel, which is unusually high.

As another starting point you can try downloading the pre-computed european LD scores here and see if those give you a more consistent h2 estimate.

Cheers,
Raymond


-----
Raymond K. Walters
Postdoctoral Research Fellow
Analytic & Translational Genetics Unit
Massachusetts General Hospital



--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/841175f9-5bc5-4d1c-8d52-0b359f686635%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shing Wan Choi

unread,
Nov 17, 2015, 12:31:37 PM11/17/15
to Raymond Walters, ldsc_users
Hi Raymond,

The log file are as follow

Options:
--out EUR
--bfile All
--l2
--ld-wind-kb 1000.0

Beginning analysis at Mon Nov 16 16:52:16 2015
Read list of 80846107 SNPs from All.bim
Read list of 503 individuals from All.fam
Reading genotypes from All.bed
Estimating LD Score.
Writing LD Scores for 23735408 SNPs to EUR.l2.ldscore.gz

Summary of LD Scores in EUR.l2.ldscore.gz
        MAF        L2
mean  0.073    81.327
std   0.128   150.286
min   0.001   -44.753
25%   0.001     2.581
50%   0.004    27.905
75%   0.082   110.233
max   0.500  3773.838

MAF/LD Score Correlation Matrix
       MAF     L2
MAF  1.000  0.362
L2   0.362  1.000
Analysis finished at Tue Nov 17 03:21:18 2015
Total time elapsed: 10.0h:29.0m:2.31s


Basically, I only filtered out the singletons. Should I only include snps on the gwas chip when computing the LD score?
You received this message because you are subscribed to a topic in the Google Groups "ldsc_users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ldsc_users/8iwBiuVO2Ow/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/216966B3-48FF-4730-B974-D79D89D264D9%40broadinstitute.org.

Raymond Walters

unread,
Nov 17, 2015, 12:49:53 PM11/17/15
to Shing Wan Choi, ldsc_users
Hi Sam,
From the original LD score paper, the recommendation was to filter to MAF > 1% for estimating the LD scores since standard errors for those scores in 1000 Genomes are quite large. That should help address the extreme distribution of scores observed here. 

I’d also recommend checking for relatedness among your 503 european samples (the LD score paper had N=378 europeans from 1000 Genomes after filtering cousins).

Cheers,
Raymond

Shing Wan Choi

unread,
Nov 18, 2015, 10:07:15 AM11/18/15
to Raymond Walters, ldsc_users
Hi  Raymond,

Thank you, I have just tried to filter the reference by relatedness where samples with pi hat larger than 0.125 were filtered out. That leaves me with 446 samples for the reference genome. I then recomputed the LDSC and get  0.1748 (0.0333) instead. This was very close to the previous estimation. 

I read the paper again and it seems like the authors also perform many additional filtering such as:

1. Only SNPs in HapMap3
2. SNPs in regions with long-range LD
3. Excluded Pericentromeric regions

But then, will the exclusion of these region leads to such a drastic increase in the heritability estimation? What should be considered as the "standard" procedure for using LDSC??

Thank you for your time!

Sam

Raymond Walters

unread,
Nov 18, 2015, 10:42:08 AM11/18/15
to Shing Wan Choi, ldsc_users
Hi Sam,
Did you also filter on MAF > 1%? I suspect that will have a much larger impact.

If the MAF filter doesn’t solve it then those 3 additional points from the paper would indeed be the next thing to check. I think (I’d have to double check) that the original paper only applied that filtering to the SNPs used for regression (in the --w-ld argument) rather than the computation of LD scores, but it should not be problematic to apply it to both. 

Cheers,
Raymond

Shing Wan Choi

unread,
Nov 18, 2015, 11:50:47 AM11/18/15
to Raymond Walters, ldsc_users
Yes, I just tried with the filtering of maf > 1% and the result doesn't change much. In fact the estimation was even smaller:   0.1605 (0.0317)
 
Nonetheless, it seems rather strange that the mere change of SNP use can change the estimation from 0.409 in the paper to the number that I have found here...

Raymond Walters

unread,
Nov 20, 2015, 6:58:55 PM11/20/15
to Shing Wan Choi, ldsc_users
Hi Sam,
I agree the magnitude of the difference is strange. 

Testing this further, I’ve confirmed that I get similar estimates to yours (h2_liab ~.19) when using our pre-computed LD scores. I’m now in the process of going back to the published analysis so hopefully we can get a more direct comparison of what may be different. I’ll keep you updated once I know more.

Cheers,
Raymond

Shing Wan Choi

unread,
Nov 20, 2015, 10:12:59 PM11/20/15
to Raymond Walters, ldsc_users
Hi Raymond,

Thank you very much. LDSC is definitely very useful, just that sometimes it might be slightly confusing as to what to use ...

Sam

Brendan

unread,
Nov 21, 2015, 8:22:53 AM11/21/15
to ldsc_users, rwal...@broadinstitute.org
Hi Sam,

The h2 estimates from Supp Table 1 were obtained using an older pre-release version of ldsc. The difference is that the older version used M = (# SNPs in the reference panel used to estimate LD Score) by default, whereas the current version defaults to M = (# of SNPs with MAF > 5% in the reference panel used to estimate LD Score). Strictly speaking, the quantity estimated by the older version of ldsc is the heritability explained by all SNPs in 1kG Europeans, but estimating this quantity requires extrapolating that h2 per SNP from common variants measured in GWAS is the same as h2 per SNP among rare variants (e.g., doubletons in 1kG). This assumption takes us far away from the support of the data, which is why decided not to make this quantity the default estimand in the version of ldsc that we eventually released. The current version of ldsc estimates the heritability explained by all MAF>5% SNPs by default, which is a much more reasonable quantity to estimate from common variant GWAS data. However, we didn't figure this out until after the first LD Score paper was published, which is why the caption to Supp Table 1 says "If the average rare SNP in 1000 Genomes explains less phenotypic variance than the average common SNP, then a smaller value of M would be more appropriate, and the estimates in this table will be biased upwards. Relaxing these assumptions in order to obtain a robust estimate of h2(1kG) is a direction for further research"

Again, this issue has since been resolved and is not a problem in the current version of ldsc. 

Brendan

Shing Wan Choi

unread,
Nov 21, 2015, 8:49:27 AM11/21/15
to Brendan, ldsc_users, Raymond Walters
Hi Brendan,

Thank you for clarifying the situation. So basically what Raymond and I obtained are the reasonable estimated based on the updated LDSC.

That's great, thank you!

Sam

--
You received this message because you are subscribed to a topic in the Google Groups "ldsc_users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ldsc_users/8iwBiuVO2Ow/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ldsc_users+...@googlegroups.com.

Brendan

unread,
Nov 21, 2015, 3:25:58 PM11/21/15
to ldsc_users, bbuliks...@gmail.com, rwal...@broadinstitute.org
Hi Sam,

Yes, I think the numbers that you and Raymond are getting are sensible.

Brendan

rbrto

unread,
Nov 21, 2015, 9:53:00 PM11/21/15
to Brendan, ldsc_users, rwal...@broadinstitute.org
hi,
would it make sense to try to estimate the slope of h2 as a function of the snp freq threshold? (1%, 5%, ...)
--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/70255c2b-b714-4e1d-bed4-1dd8f4901794%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages