Dear all,
I am new to LD score regression method. Sorry for my naive questions. We are now meta-analyzing the exome-chip data in a big sample size. We noticed there is a huge inflation of the lambda value (ƛ = 2) for the common variants (MAF > 0.05), but the lambda value is acceptable (ƛ = 1.04) for the low-frequency variants and rare variants (MAF < 0.05). We want to check whether the huge inflation of the lambda values in common variants is due to the polygenic model with increased sample size.
I want to use the phase3 1000 Genome EUR data. Originally, I want to extract these common variants in our results from the 1000 Genome dataset and calculate the LD scores based on the 1000G .bed/.bim/.fam files, and then regress the LD scores against the Chi-square value of each common variants to get the LD regression intercept. However, after I read something like “the variants included in LD score regression are a subset of the variants included in LD score estimation”. I am a little puzzled of how to proceed?
Another question is how to make the figure 2 or supplementary figure 8a-w in that paper? How to get the regression weight? I imagine that I can bin the LD score values into the bins of -50, 50-100, 100-150, 150-200, 200-250, 250- and I could also get the mean Chi-square values for the variants in each bin? How to get the regression weight?
Thanks for your help and time.
Kevin
python ldsc.py
--h2 scz.sumstats.gz
--ref-ld-chr eur_w_ld_chr/
--w-ld-chr eur_w_ld_chr/
--out scz
re: the figures -- currently there is no way to produce these figures automatically in ldsc, but it isn't hard to do. Just merge the LD Scores and the .sumstats files, bin the SNPs into 50 LD Score quantiles (e.g., using the cut2 function in the Hmisc R package, or qcut in python pandas), and for each quantile bin, plot a dot where the x-coordinate is the mean LD Score for SNPs in that quantile bin and the y-coordinate is the mean chi^2 for SNPs in that quantile bin.
Let me know how the results look; happy to help you troubleshoot.
Hi Brendan,
Thanks a lot for your help and suggestion. I did the following steps:
Options:
--out BMI_sum
--merge-alleles w_hm3.snplist
--sumstats GIANTMETA_BMI_ADD_ALL_common.txt
Interpreting column names as follows:
N: Sample size
a1: Allele 1, interpreted as ref allele for signed sumstat.
beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)
a2: Allele 2, interpreted as non-ref allele for signed sumstat.
SNP: Variant ID (e.g., rs number)
Pvalue: p-Value
Reading list of SNPs for allele merge from w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from GIANTMETA_BMI_ADD_ALL_common.txt into memory 5000000.0 SNPs at a time.
Read 27559 SNPs from --sumstats file.
Removed 11218 SNPs not in --merge-alleles.
Removed 213 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 0 variants that were not SNPs or were strand-ambiguous.
16128 SNPs remain.
Removed 0 SNPs with duplicated rs numbers (16128 SNPs remain).
Removed 538 SNPs with N < 311121.333333 (15590 SNPs remain).
Median value of beta was 7.35e-05, which seems sensible.
Removed 0 SNPs whose alleles did not match --merge-alleles (15590 SNPs remain).
Writing summary statistics for 1217311 SNPs (15590 with nonmissing beta) to BMI_sum.sumstats.gz.
Metadata:
Mean chi^2 = 3.461
Lambda GC = 2.011
Max chi^2 = 1096.911
171 Genome-wide significant SNPs (some may have been removed by filtering).
Conversion finished at Wed Apr 29 18:12:54 2015
Total time elapsed: 9.46s
Options:
--h2 BMI.sumstats.gz
--ref-ld-chr eur_w_ld_chr/
--out BMI
--w-ld-chr eur_w_ld_chr/
Beginning analysis at Wed Apr 29 14:13:20 2015
Reading summary statistics from BMI.sumstats.gz ...
Read summary statistics for 15590 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1293150 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1293150 SNPs.
After merging with reference panel LD, 15589 SNPs remain.
After merging with regression SNP LD, 15589 SNPs remain.
WARNING: number of SNPs less than 200k; this is almost always bad.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.2499 (0.0611)
Lambda GC: 2.0091
Mean Chi^2: 3.4606
Intercept: 1.1704 (0.0649)
Ratio: 0.0693 (0.0264)
Analysis finished at Wed Apr 29 14:13:34 2015
Total time elapsed: 13.73s
Here is the plot (attachment). I still have not got the regression weight? As you see, there are so many SNPs are not included in the analysis (11218 of 27559 SNPs are removed). What is the difference between Intercept from h2 step [Intercept: 1.1704 (0.0649)] and the intercept from the figure (2.1688)? I plot the means of L2 and chi-square in each of the 50 bins in excel and then use the linear trend to get the line. I can imagine that there will be an intercept based on this line (2.1688)?
Perhaps, I should use the phase3 1000G data to re-calculate the LD score to include more variants instead of using the files in the directory of eur_w_ld_chr?
Thanks for your suggestions.
Kevin
I tried the new full 1kG LD scores. Unexpected that the intercept is increased (Intercept: 1.3197 (0.0712)) compared with the previous results (Intercept: 1.1704 (0.0649), based on SNPs with INFO > 0.9 in the SCZ GWAS).
We have more SNPs included in the analysis with the full 1kG score (18689 SNPs vs 15589 SNPs).
This might be due to ‘normal’ fluctuation in the estimate?
Thank you,
Kevin
Options:
--h2 BMI_0430.sumstats.gz
--ref-ld-chr var_exp_info09/
--out BMI_0430_LD
--w-ld-chr var_exp_info09/
Beginning analysis at Thu Apr 30 13:51:25 2015
Reading summary statistics from BMI_0430.sumstats.gz ...
Read summary statistics for 23373 SNPs.
Reading reference panel LD Score from var_exp_info09/[1-22] ...
Read reference panel LD Scores for 5886653 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from var_exp_info09/[1-22] ...
Read regression weight LD Scores for 5886653 SNPs.
After merging with reference panel LD, 18689 SNPs remain.
After merging with regression SNP LD, 18689 SNPs remain.
WARNING: number of SNPs less than 200k; this is almost always bad.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.2251 (0.0516)
Lambda GC: 2.0641
Mean Chi^2: 3.4134
Intercept: 1.3197 (0.0712)
Ratio: 0.1325 (0.0295)
Analysis finished at Thu Apr 30 13:52:28 2015
Total time elapsed: 1.0m:3.23s
Options:
--h2 BMI_0430.sumstats.gz
--ref-ld-chr eur_w_ld_chr/
--out BMI_0512
--w-ld-chr eur_w_ld_chr/
Beginning analysis at Tue May 12 16:20:39 2015
Reading summary statistics from BMI_0430.sumstats.gz ...
Read summary statistics for 23373 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1293150 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1293150 SNPs.
After merging with reference panel LD, 15589 SNPs remain.
After merging with regression SNP LD, 15589 SNPs remain.
WARNING: number of SNPs less than 200k; this is almost always bad.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.2499 (0.0611)
Lambda GC: 2.0091
Mean Chi^2: 3.4606
Intercept: 1.1704 (0.0649)
Ratio: 0.0693 (0.0264)
Analysis finished at Tue May 12 16:20:53 2015
Total time elapsed: 13.33s
Thank you, Brendan.
We are going to do the last round analysis. I will use HM3 LD scores (in the directory of eur_w_ld) for our late analysis. I will keep you informed.
Kevin
Hi Brendan,
I compared the mean [(chi^2-1)/n] for 22,595 SNPs between our Exome chip array results and latest Giant nature BMI results (PMID:25673413). There is no big difference. Could you explain more (the rationale) on doing this comparison? I got the chi^2 from the squared (beta/se).
Exome chip mean [(chi^2-1)/n] = 6.14E-06
|
Basic Statistical Measures |
|||
|
Location |
Variability |
||
|
Mean |
6.14E-06 |
Std Deviation |
5.46E-05 |
|
Median |
-9.07E-08 |
Variance |
2.98E-09 |
|
Mode |
. |
Range |
0.00264 |
|
|
|
Interquartile Range |
6.69E-06 |
GWAS mean [(chi^2-1)/n] = 2.00E-06
|
Basic Statistical Measures |
|||
|
Location |
Variability |
||
|
Mean |
2.00E-06 |
Std Deviation |
6.34E-05 |
|
Median |
-1.97E-06 |
Variance |
4.02E-09 |
|
Mode |
0 |
Range |
0.00317 |
|
|
|
Interquartile Range |
6.93E-06 |
The intercept is 1.25 (0.06) for 15,842 SNPs using HapMap3 (eur_w_ld_chr) reference file, and is 1.49 (0.09) for 21,892 SNPs using 1000G (var_exp_info09) files. Exomechip common SNPs are heavily enriched with known GWAS hits (non-ramdomly selected). Could this explain the increased intercept? or maybe, the small number of SNPs (<< the requrired 200kb) included in the analysis could be another reason?
Another question, how could we get the LD score regression line and how to get rid of these outliers in the LD score plot? With the var_exp_info09 reference data, I can get the information from 21,892 SNPs (out of 27,518 SNPs).
As always, thank you,
Kevin
Hi Brendan,
Sorry for being so late to answer your questions (completely distracted recently by other unrelated tasks).
The GWAS (Giant nature BMI results (PMID:25673413)) is double-GC corrected : Individual GWAS cohort was GC corrected (if lambda > 1.0) during the input for meta-analyzing the GWAS cohort data; individual Metabochip cohort was GC corrected based on selected SNPs during the input for meta-analyzing Metabochip cohort data;In the final meta-analysis (meta-analyzing the GWAS meta-results and Metabochip meta-results), the GWAS meta-results was GC corrected during the input, and the Metabochip meta-results was GC corrected during the input.
I guess in the final analysis step, the GWAS meta-results should not be corrected for the lambda value, and also the Metabochip meta-results should not be corrected for the lambda value based on the selected SNPs?
The Exomechip results were not GC corrected. PCs and empirical genetic relationship were adjusted during the association analysis in each individual cohort. We examined the lambda values for each contributing cohort, and the values are within expected range (around 1).
Upon your suggestion, I can manually make the LD Score plot. I can also manually delete the outliers. A question for me now is how I could define the lope for the line?
Hapmap3 results, 15,842 SNPs included: Lambda GC:2.1742; Mean Chi^2: 3.6791; Intercept:1.2473 (0.0651); Ratio:0.0923 (0.0243);
1000G results, 21,892 SNPs included: Lambda GC:2.1873; Mean Chi^3: 3.6175; Intercept:1.491 (0.0885); Ratio:0.1876(0.0338);
We have got all our results now, and try your method in case the reviewer will ask us some questions along this line.
Thank you,
Kevin
Dear Brendan,
I am so sorry. I have to trouble you again on this question.
I can manually make the LD Score plot. I can also manually delete the outliers. A question for me now is how I could define the slope for the line?
Thank you,
Kevin