LD score for Chinese population

wong jane

unread,

May 17, 2016, 12:39:27 AM5/17/16

to ldsc_users

Hi,

Regarding the data of East Asian LD Scores from 1000 Genomes provided on the website, I'd like to ask which 1000G subpopulations were used to estimate the LD scores. Is it an adequate reference panel for the related analyses based on the Chinese population? In addition, for the estimation of total heritability, whether this data can also be used for weighting the regression (--w-ld)? I'm not very clear about the use of this weighting LD score, anyone can give some brief introduction about it? Many thanks.

Best regards,

Jane

Raymond Walters

unread,

May 17, 2016, 1:19:49 PM5/17/16

to wong jane, ldsc_users

Hi Jane,

The East Asian LD Scores were computed from the 1000 Genomes phase 3 EAS samples, which includes samples from 5 population cohorts:

- Han Chinese in Bejing, China (CHB)

- Japanese in Tokyo, Japan (JPT)

- Southern Han Chinese (CHS)

- Chinese Dai in Xishuangbanna, China (CDX)

- Kinh in Ho Chi Minh City, Vietnam (KHV)

I’m not an expert on LD structure in East Asian populations, but I would generally expect this to be a more than adequate reference panel for the Chinese population.

The provided files are appropriate for the --w-ld argument. The weights are used the address the fact that in the LD score model the variance of the GWAS summary statistics (as well as the mean) is related to the LD score. To adjust for the resulting heteroscedasticity, LD score uses a weighted regression. Details are in the initial LD score paper's Methods and Supplementary Note.

The current recommendation is that it’s sufficient to use the same file for ref-ld (the actual values used in the regression) and w-ld (for weighing the weighted regression) for unpartitioned heritability and genetic correlation analysis. Different LD score files are needed for partitioned hertaibility analysis (--ref-ld scores should be partitioned, --w-ld scores should not).

Cheers,

Raymond

--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/4bb2f2ad-1f62-429d-8910-1c7aee0b8803%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wong jane

unread,

May 18, 2016, 12:07:15 AM5/18/16

to ldsc_users, jane.w...@gmail.com

Hi Raymond,

Thanks for your quick reply, it's very helpful. In addition, I have another questions about the use of the software.

1) It's recommended to use the HapMap3 SNP list for LD score regression, but if the summary statistics are derived from genotyped data, rather than imputed data, and the genotyping quality is fine, then we don't need to set the --merge-alleles flag to filter the SNPs, right? Because there are a large part of SNPs not in this hm3 list, and I hope to use all SNPs in my data. Furthermore, if we have the information of imputation quality, we also don't need this hm3 list, right?

2) If the mean chi-square is too small (<1.02), it means it's not suitable for the regression analysis, but what's the reason? Is it due to the small sample size, low heritability for this phenotype or some others? The sample size of my data is ~5700. Please see the log files below.

3) Whether we can calculate a P-value for the total observed h2 with the null hypothesis h2=0?

Best,

Jane

*********************************************************************

* LD Score Regression (LDSC)

* Version 1.0.0

* Broad Institute of MIT and Harvard / MIT Department of Mathematics

* GNU General Public License v3

*********************************************************************

Call:

./munge_sumstats.py \

--out result/chd_nonadj \

--sumstats data/chd_nonadj.assoc

Interpreting column names as follows:

info: INFO score (imputation quality; higher --> better imputation)

snpid: Variant ID (e.g., rs number)

N: Sample size

a1: Allele 1, interpreted as ref allele for signed sumstat.

pval: p-Value

a2: Allele 2, interpreted as non-ref allele for signed sumstat.

or: Odds ratio (1 --> no effect; above 1 --> A1 is risk increasing)

Reading sumstats from data/chd_nonadj.assoc into memory 5000000 SNPs at a time.

Read 1257031 SNPs from --sumstats file.

Removed 0 SNPs with missing values.

Removed 0 SNPs with INFO <= 0.9.

Removed 0 SNPs with MAF <= 0.01.

Removed 0 SNPs with out-of-bounds p-values.

Removed 43338 variants that were not SNPs or were strand-ambiguous.

1213693 SNPs remain.

Removed 0 SNPs with duplicated rs numbers (1213693 SNPs remain).

Removed 0 SNPs with N < 3828.0 (1213693 SNPs remain).

Median value of or was 0.9998, which seems sensible.

Writing summary statistics for 1213693 SNPs (1213693 with nonmissing beta) to result/chd_nonadj.sumstats.gz.

Metadata:

Mean chi^2 = 0.995

WARNING: mean chi^2 may be too small.

Lambda GC = 0.992

Max chi^2 = 22.907

0 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Wed May 18 10:53:15 2016

Total time elapsed: 18.09s

*********************************************************************

* LD Score Regression (LDSC)

* Version 1.0.0

* Broad Institute of MIT and Harvard / MIT Department of Mathematics

* GNU General Public License v3

*********************************************************************

Call:

./ldsc.py \

--h2 result/chd_nonadj.sumstats.gz \

--ref-ld-chr ref_data/eas_ldscores/ \

--out result/chd_nonadj.h2 \

--w-ld-chr ref_data/eas_ldscores/

Beginning analysis at Wed May 18 10:54:35 2016

Reading summary statistics from result/chd_nonadj.sumstats.gz ...

Read summary statistics for 1213693 SNPs.

Reading reference panel LD Score from ref_data/eas_ldscores/[1-22] ...

Read reference panel LD Scores for 1208050 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from ref_data/eas_ldscores/[1-22] ...

Read regression weight LD Scores for 1208050 SNPs.

After merging with reference panel LD, 579457 SNPs remain.

After merging with regression SNP LD, 579457 SNPs remain.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: 0.0023 (0.084)

Lambda GC: 0.9927

Mean Chi^2: 0.9951

Intercept: 0.9948 (0.0064)

Ratio: NA (mean chi^2 < 1)

Analysis finished at Wed May 18 10:54:44 2016

Total time elapsed: 9.41s

Raymond Walters於 2016年5月18日星期三 UTC+8上午1時19分49秒寫道：

Raymond Walters

unread,

May 18, 2016, 2:43:53 PM5/18/16

to wong jane, ldsc_users

Hi Jane,

1) The benefit of --merge-alleles with HapMap3 SNPs is largely for pre-checking that the alleles in the GWAS results match the reference and for consistency in genetic correlation analyses comparing to published summary statistics from outside sources. If those aren’t issues for your analysis then you should be fine without it (though still using --merge-alleles / HM3 wouldn’t do much harm either). It just leaves you with more responsibility for making sure you’re filtering to a sufficiently clean and representative set of SNPs for analysis.

2) Small mean chi-square tends to indicate lack of polygenic signal. Low power (small N), low heritability, insufficient filtering of input SNPs, or application of genomic control can all contribute to deflated mean chi-square. In these cases, there likely won’t be enough signal to fit in LD score regression to get informative results (though there is no harm in running the regression to verify). Anecdotally, results such as your log file with reasonable sample size but mean chi-square < 1 tend to reflect either use of genomic control or insufficient filtering on MAF.

3) Yes, p-values can be computed by dividing the h2 estimate by it’s standard error and treating the result as a z-statistic.

Cheers,

Raymond

To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/0f491f5c-152f-4f5c-ae5d-f8f6d00df26a%40googlegroups.com.

wong jane

unread,

May 19, 2016, 6:13:42 AM5/19/16

to ldsc_users, jane.w...@gmail.com

Hi Raymond,

Many thanks. First, I hope to know a little bit more about the HM3 SNP list. From previous post, I know this list is not European-specific, it's also suitable for EAS population. Then I want to ask how this list was obtained. These SNPs are well-imputed common SNPs from HapMap 3 data, but "well-imputed" is for European or Asian or all populations? Because the MAF or imputation quality may be significantly different for different population. Furthermore, I guess the rs number of this list is based on build 36 of HM3, but for the 1000G reference panel, is it also based on build 36 for the rs number? Because it seems the software will only keep the overlap SNPs between the 1000G and regression SNPs, while the rs number of a SNP may be updated in the new version build. For my analysis, originally there are ~1.25 million SNPs with MAF>1% (1.09 million with MAF >5%), but I found only 0.57 million (46%) SNPs were overlapped with this HM3 list. The rs number in my data is based on build 37, I'm not sure if this is a major issue.

Second, as you mentioned in item 2, small mean chi-square may be due to insufficient filtering on MAF in my analysis (I didn't use genomic control). Does it mean I should filter the SNPs using a higher level of cut-off (e.g. >5%)? All MAF >1% after standard quality control for my data.