LD score regression

2,271 views
Skip to first unread message

kevin....@gmail.com

unread,
Apr 28, 2015, 5:28:36 PM4/28/15
to ldsc_...@googlegroups.com

Dear all,

I am new to LD score regression method. Sorry for my naive questions. We are now meta-analyzing the exome-chip data in a big sample size. We noticed there is a huge inflation of the lambda value  (ƛ = 2) for the common variants (MAF  > 0.05), but the lambda value  is acceptable (ƛ = 1.04) for the low-frequency variants and rare variants (MAF < 0.05). We want to check whether the huge inflation of the lambda values in common variants is due to the polygenic model with increased sample size.

I want to use the phase3 1000 Genome EUR data. Originally, I want to extract these common variants in our results from the 1000 Genome dataset and calculate the LD scores based on the 1000G .bed/.bim/.fam files, and then regress the LD scores against the Chi-square value of each common variants to get the LD regression intercept. However, after I read something like “the variants included in LD score regression are a subset of the variants included in LD score estimation”. I am a little puzzled of how to proceed?

Another question is how to make the figure 2 or supplementary figure 8a-w in that paper? How to get the regression weight? I imagine that I can bin the LD score values into the bins of -50, 50-100, 100-150, 150-200, 200-250, 250- and I could also get the mean Chi-square values for the variants in each bin? How to get the regression weight?

 Thanks for your help and time.

 Kevin

Brendan

unread,
Apr 28, 2015, 6:43:59 PM4/28/15
to ldsc_...@googlegroups.com
Hi Kevin,

A few questions to help me get oriented -- are your common variant data also from exome chip? Is it exome chip + imputation or just exome chip? Are these European-only data? Did you run the analysis with PC covariates? 

re: getting everything to run -- I suggest you start with the genetic correlation tutorial. The steps for computing the LD Score regression intercept are basically the same as the steps for computing genetic correlation, except you will have one trait instead of two, and you can use the --h2 flag instead of --rg. For example, if you just want to compute h2 for scz as in the tutorial, you would run

python ldsc.py
    --h2 scz.sumstats.gz
    --ref-ld-chr eur_w_ld_chr/
    --w-ld-chr eur_w_ld_chr/
    --out scz
re: the figures -- currently there is no way to produce these figures automatically in ldsc, but it isn't hard to do. Just merge the LD Scores and the .sumstats files, bin the SNPs into 50 LD Score quantiles (e.g., using the cut2 function in the Hmisc R package, or qcut in python pandas), and for each quantile bin, plot a dot where the x-coordinate is the mean LD Score for SNPs in that quantile bin and the y-coordinate is the mean chi^2 for SNPs in that quantile bin. 

Let me know how the results look; happy to help you troubleshoot.

kevin....@gmail.com

unread,
Apr 29, 2015, 6:34:51 PM4/29/15
to ldsc_...@googlegroups.com

Hi Brendan,

Thanks a lot for your help and suggestion. I did the following steps:

Options:

--out BMI_sum

--merge-alleles w_hm3.snplist

--sumstats GIANTMETA_BMI_ADD_ALL_common.txt

 

Interpreting column names as follows:

N:     Sample size

a1:    Allele 1, interpreted as ref allele for signed sumstat.

beta:  [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)

a2:    Allele 2, interpreted as non-ref allele for signed sumstat.

SNP:   Variant ID (e.g., rs number)

Pvalue:       p-Value

 

Reading list of SNPs for allele merge from w_hm3.snplist

Read 1217311 SNPs for allele merge.

Reading sumstats from GIANTMETA_BMI_ADD_ALL_common.txt into memory 5000000.0 SNPs at a time.

Read 27559 SNPs from --sumstats file.

Removed 11218 SNPs not in --merge-alleles.

Removed 213 SNPs with missing values.

Removed 0 SNPs with INFO <= 0.9.

Removed 0 SNPs with MAF <= 0.01.

Removed 0 SNPs with out-of-bounds p-values.

Removed 0 variants that were not SNPs or were strand-ambiguous.

16128 SNPs remain.

Removed 0 SNPs with duplicated rs numbers (16128 SNPs remain).

Removed 538 SNPs with N < 311121.333333 (15590 SNPs remain).

Median value of beta was 7.35e-05, which seems sensible.

Removed 0 SNPs whose alleles did not match --merge-alleles (15590 SNPs remain).

Writing summary statistics for 1217311 SNPs (15590 with nonmissing beta) to BMI_sum.sumstats.gz.

 

Metadata:

Mean chi^2 = 3.461

Lambda GC = 2.011

Max chi^2 = 1096.911

171 Genome-wide significant SNPs (some may have been removed by filtering).

 

Conversion finished at Wed Apr 29 18:12:54 2015

Total time elapsed: 9.46s

Options:

--h2 BMI.sumstats.gz

--ref-ld-chr eur_w_ld_chr/

--out BMI

--w-ld-chr eur_w_ld_chr/

 

Beginning analysis at Wed Apr 29 14:13:20 2015

Reading summary statistics from BMI.sumstats.gz ...

Read summary statistics for 15590 SNPs.

Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...

Read reference panel LD Scores for 1293150 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...

Read regression weight LD Scores for 1293150 SNPs.

After merging with reference panel LD, 15589 SNPs remain.

After merging with regression SNP LD, 15589 SNPs remain.

WARNING: number of SNPs less than 200k; this is almost always bad.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: 0.2499 (0.0611)

Lambda GC: 2.0091

Mean Chi^2: 3.4606

Intercept: 1.1704 (0.0649)

Ratio: 0.0693 (0.0264)

Analysis finished at Wed Apr 29 14:13:34 2015

Total time elapsed: 13.73s

Here is the plot (attachment). I still have not got the regression weight? As you see, there are so many SNPs are not included in the analysis (11218 of 27559 SNPs are removed). What is the difference between Intercept from h2 step [Intercept: 1.1704 (0.0649)] and the intercept from the figure (2.1688)? I plot the means of L2 and chi-square in each of the 50 bins in excel and then use the linear trend to get the line. I can imagine that there will be an intercept based on this line (2.1688)? 

Perhaps, I should use the phase3 1000G data to re-calculate the LD score to include more variants instead of using the files in the directory of eur_w_ld_chr?

 

Thanks for your suggestions.

 

Kevin


On Tuesday, April 28, 2015 at 5:28:36 PM UTC-4, kevin....@gmail.com wrote:
plots.docx

kevin....@gmail.com

unread,
Apr 30, 2015, 7:34:59 AM4/30/15
to ldsc_...@googlegroups.com
Hi Brendan,

I am sorry not having answer your question. All the variants are from the exome-chip (developed by Illumina company to especially assay the rare variants but they also put many common variants on the chip). All the studies used the same chip and genotyped all the variants. 

Thanks for your suggestion and help.

Kevin


On Tuesday, April 28, 2015 at 5:28:36 PM UTC-4, kevin....@gmail.com wrote:

Brendan

unread,
Apr 30, 2015, 9:12:10 AM4/30/15
to ldsc_...@googlegroups.com
Hi Kevin,

Ok, I think I understand the problem formulation now.

First, here are a few thoughts: 
  • Common coding SNPs are hugely enriched for h2 across a lot of traits (e.g., see Gusev et al AJHG 2014; Finucane et al bioRxiv 2015), so it isn't surprising that the mean chi^2 / lambda among SNPs in enriched regions is very high at huge N. 
  • I think that the chi^2 vs MAF pattern that you see in your data is unlikely to have resulted from confounding. If you have problems with pop strat, it ought to effect rare and common SNPs to roughly the same extent. On the other hand, if the inflation in mean chi^2 mostly comes from polygenicity and the genetic architecture of BMI is such that most h2 comes from common SNPs rather than rare SNPs, then this could explain the result (I suspect that this is the case based on some preliminary results on MAF-partitioned LD Score regression, but don't quote me on that). Actually, I view low lambda at rare SNPs as evidence against massive confounding due to pop strat. 
Second, here are 1kG LD Scores (actually 1kG LD Scores for SNPs with INFO > 0.9 in the SCZ GWAS, which is the best I have lying around at the moment. If that doesn't work, let me know; I can compute full 1kG LD Scores for you very easily)



re: the plot -- I wouldn't bother plotting the regression weights like we did in the paper; it doesn't add much. The intercept that you obtain from fitting a line through the quantile bins will differ from the intercept that ldsc gives you, because ldsc (1) uses a weighted regression and (2) filters out gwsig SNPs when computing the intercept (because linear regression is not robust to outliers in the response variable). In your case, it looks like there's a big-effect locus with LD Scores ~20 that is pulling the intercept up.

Btw, you should ignore the h2 estimate that ldsc gives you here -- if the regression SNPs are a small, non-random subset of the reference panel SNPs, then the estimate will be wrong. 

Here's another idea for a simple sanity check you might want to try: take those 25k SNPs, look up their N and non-GC corrected chi^2 in the latest gwas meta-analysis (which I believe is only is minimally affected by pop strat) and compare mean[ (chi^2-1) / N ] for those 25k SNPs from the gwas meta-analysis to mean[ (chi^2-1) / N ] from the exome chip meta-analysis. This is the same thing (up to multiplication by a constant) as comparing h2 computed using ldsc with the --no-intercept flag from the non GC corrected summary statistics from the gwas meta-analysis (restricting to those 25k SNPs) and the exome chip summary statistics. The absolute h2 numbers will not be correct for the reason described in the previous paragraph, but the difference between the two is meaningful (in particular; if the h2 you get from the exome chip data is >> what you get from the gwas data, then you might have a problem). 

kevin....@gmail.com

unread,
May 12, 2015, 5:34:50 PM5/12/15
to ldsc_...@googlegroups.com

I tried the new full 1kG LD scores. Unexpected that the intercept is increased (Intercept: 1.3197 (0.0712)) compared with the previous results (Intercept: 1.1704 (0.0649), based on SNPs with INFO > 0.9 in the SCZ GWAS).

 

We have more SNPs included in the analysis with the full 1kG score (18689 SNPs vs 15589 SNPs).

 

This might be due to ‘normal’ fluctuation in the estimate?

 

Thank you,

 

Kevin

 

Options:

--h2 BMI_0430.sumstats.gz

--ref-ld-chr var_exp_info09/

--out BMI_0430_LD

--w-ld-chr var_exp_info09/

 

Beginning analysis at Thu Apr 30 13:51:25 2015

Reading summary statistics from BMI_0430.sumstats.gz ...

Read summary statistics for 23373 SNPs.

Reading reference panel LD Score from var_exp_info09/[1-22] ...

Read reference panel LD Scores for 5886653 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from var_exp_info09/[1-22] ...

Read regression weight LD Scores for 5886653 SNPs.

After merging with reference panel LD, 18689 SNPs remain.

After merging with regression SNP LD, 18689 SNPs remain.

WARNING: number of SNPs less than 200k; this is almost always bad.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: 0.2251 (0.0516)

Lambda GC: 2.0641

Mean Chi^2: 3.4134

Intercept: 1.3197 (0.0712)

Ratio: 0.1325 (0.0295)

Analysis finished at Thu Apr 30 13:52:28 2015

Total time elapsed: 1.0m:3.23s

 

Options:

--h2 BMI_0430.sumstats.gz

--ref-ld-chr eur_w_ld_chr/

--out BMI_0512

--w-ld-chr eur_w_ld_chr/

 

Beginning analysis at Tue May 12 16:20:39 2015

Reading summary statistics from BMI_0430.sumstats.gz ...

Read summary statistics for 23373 SNPs.

Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...

Read reference panel LD Scores for 1293150 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...

Read regression weight LD Scores for 1293150 SNPs.

After merging with reference panel LD, 15589 SNPs remain.

After merging with regression SNP LD, 15589 SNPs remain.

WARNING: number of SNPs less than 200k; this is almost always bad.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: 0.2499 (0.0611)

Lambda GC: 2.0091

Mean Chi^2: 3.4606

Intercept: 1.1704 (0.0649)

Ratio: 0.0693 (0.0264)

Analysis finished at Tue May 12 16:20:53 2015

Total time elapsed: 13.33s

Brendan

unread,
May 13, 2015, 9:04:23 AM5/13/15
to ldsc_...@googlegroups.com, kevin....@gmail.com
Yup, I agree with your interpretation. The distinction between 1.17 (0.07) and 1.32 (0.07) isn't really statistically significant [1]. The fact that the 1kG LD Score intercept is (slightly) higher than the HM3 intercept is consistent with what we've seen in datasets with essentially zero pop strat (e.g., in homogeneous, carefully QC'ed datasets with mixed model test statistics), which is why we've been recommending the HM3 LD Scores for most purposes.

Either way, the ratio statistics are well within the comfortable range (i.e., the intercept is <<< mean chi^2), so seems OK?

Brendan

[1] Actually it's even less significant than the SE's suggest -- the SE's quantify variation that would come from drawing new samples from the same data generating process and repeating the model-fitting procedure using the same LD Scores and the same regression SNPs. If you use a different set of regression SNPs and a different set of LD Scores, then you have introduced another source of variation (model uncertainty) not captured by the SE's. 

kevin....@gmail.com

unread,
May 13, 2015, 1:17:45 PM5/13/15
to ldsc_...@googlegroups.com, kevin....@gmail.com

Thank you, Brendan.

We are going to do the last round analysis. I will use HM3 LD scores (in the directory of eur_w_ld) for our late analysis. I will keep you informed.

Kevin

kevin....@gmail.com

unread,
Jun 25, 2015, 5:32:54 PM6/25/15
to ldsc_...@googlegroups.com

Hi Brendan,


I compared the mean [(chi^2-1)/n] for 22,595 SNPs between our Exome chip array results and latest Giant nature BMI results (PMID:25673413).  There is no big difference. Could you explain more (the rationale) on doing this comparison?  I got the chi^2 from the squared (beta/se). 

Exome chip mean [(chi^2-1)/n] = 6.14E-06


Basic Statistical Measures

Location

Variability

Mean

6.14E-06

Std Deviation

5.46E-05

Median

-9.07E-08

Variance

2.98E-09

Mode

.

Range

0.00264

 

 

Interquartile Range

6.69E-06

 

GWAS mean [(chi^2-1)/n] = 2.00E-06


Basic Statistical Measures

Location

Variability

Mean

2.00E-06

Std Deviation

6.34E-05

Median

-1.97E-06

Variance

4.02E-09

Mode

0

Range

0.00317

 

 

Interquartile Range

6.93E-06

 


The intercept is 1.25 (0.06) for 15,842 SNPs using HapMap3 (eur_w_ld_chr) reference file, and is 1.49 (0.09) for 21,892 SNPs using 1000G (var_exp_info09) files. Exomechip common SNPs are heavily enriched with known GWAS hits (non-ramdomly selected). Could this explain the increased intercept? or maybe, the small number of SNPs  (<< the requrired 200kb) included in the analysis could be another reason? 


Another question, how could we get the LD score regression line and how to get rid of these outliers in the LD score plot? With the var_exp_info09 reference data, I can get the information from 21,892 SNPs (out of 27,518 SNPs).


As always, thank you,

Kevin

Brendan

unread,
Jul 3, 2015, 6:06:42 AM7/3/15
to ldsc_...@googlegroups.com, kevin....@gmail.com
Hi Kevin,

(chi^2 - 1) / N is basically proportional to what you get out of constrained intercept LD Score regression (a more or less equivalent analysis would have been to run constrained intercept LD Score regression), so if there's no pop strat in either dataset (or if there's comparable pop strat in both datasets), then (chi^2 - 1) / N should be equal for the same set of SNPs in both datasets. Sorry, could you remind me if either dataset has GC correction? These aren't directly comparable if there is GC correction. 

re: the LD Score plot -- you'll have to do this manually; we haven't implemented this in ldsc.

re: increased intercepts -- could you tell me the value of the ratio statistic? It's hard to say whether an intercept of 1.2 should be considered larger without also knowing the mean chi^2 (if mean chi^2 is 1.3, then an intercept of 1.2 is huge; if mean chi^2 is 3 then an intercept of 1.2 is nothing)

Cheers,
Brendan

kevin....@gmail.com

unread,
Aug 5, 2015, 12:14:47 PM8/5/15
to ldsc_users, kevin....@gmail.com

Hi Brendan,

Sorry for being so late to answer your questions (completely distracted recently by other unrelated tasks).

The GWAS (Giant nature BMI results (PMID:25673413)) is double-GC corrected : Individual GWAS cohort was GC corrected  (if lambda > 1.0) during the input for meta-analyzing the GWAS cohort data; individual Metabochip cohort was GC corrected based on selected SNPs during the input for meta-analyzing Metabochip cohort data;In the final meta-analysis (meta-analyzing the GWAS meta-results and Metabochip meta-results), the GWAS meta-results was GC corrected during the input, and the Metabochip meta-results was GC corrected during the input.

I guess in the final analysis step, the GWAS meta-results should not be corrected for the lambda value, and also the Metabochip meta-results should not be corrected for the lambda value based on the selected SNPs?

The Exomechip results were not GC corrected. PCs and empirical genetic relationship were adjusted during the association analysis in each individual cohort. We examined the lambda values for each contributing cohort, and the values are within expected range (around 1).

 

Upon your suggestion, I can manually make the LD Score plot. I can also manually delete the outliers.  A question for me now is how I could define the lope for the line?

 

Hapmap3 results, 15,842 SNPs included: Lambda GC:2.1742; Mean Chi^2: 3.6791; Intercept:1.2473 (0.0651); Ratio:0.0923 (0.0243);

1000G results, 21,892 SNPs included: Lambda GC:2.1873; Mean Chi^3: 3.6175; Intercept:1.491 (0.0885); Ratio:0.1876(0.0338);

 

We have got all our results now, and try your method in case the reviewer will ask us some questions along this line.

Thank you,

Kevin 

kevin....@gmail.com

unread,
Sep 27, 2015, 10:31:20 AM9/27/15
to ldsc_users, kevin....@gmail.com

Dear Brendan,

 

I am so sorry. I have to trouble you again on this question.

 

I can manually make the LD Score plot. I can also manually delete the outliers.  A question for me now is how I could define the slope for the line?

 

Thank you,

 

Kevin

Brendan

unread,
Sep 28, 2015, 9:04:17 AM9/28/15
to ldsc_users, kevin....@gmail.com
If you plot chi^2 against LD Score, the slope will be N*h2_hat/M. ldsc prints out h2_hat. To convert h2_hat to N*h2_hat/M, multiply h2 _hat by N/M, where N is the sample size specified in your .sumstats file and M is the number in the .M_5_50 file (1176350 if you are using eur_w_ld_chr/)

Best,
Brendan

kevin....@gmail.com

unread,
Nov 1, 2015, 10:22:38 PM11/1/15
to ldsc_users, kevin....@gmail.com
Hi Brendan,

Thanks a lot. Yes, I got the h2 from the ldsc (Total Observed scale h2: 0.7921) and also the M=1176350 for eur_w_ld_chr/. I selected the N as the maximum sample size in the N column of my sumstats file. Is this correct?

Best,

Kevin
Reply all
Reply to author
Forward
0 new messages