Is my data suitable for LD score regression method?

jyle...@gmail.com

unread,

Nov 9, 2015, 12:56:16 PM11/9/15

to ldsc_users

My data is from one population and there's an inflation (lambda GC is about 1.1).

One GWAS is also suitable for ldsc method?

I tried this method and intercept from ldsc is even higher than lambda GC (intercept was 1.2). So I wonder if this ldsc mehod is not suitable for my data.

Thanks in advance,

Raymond Walters

unread,

Nov 9, 2015, 1:04:50 PM11/9/15

to jyle...@gmail.com, ldsc_users

Hello,

The ldsc method can work with GWAS from a single dataset, though sample size will often be an issue. Can you post your ldsc log files so we can help troubleshoot?

Thanks,

Raymond

-----

Raymond K. Walters

Research Fellow

Analytic & Translational Genetics Unit

Massachusetts General Hospital

rwal...@broadinstitute.org

--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/9a469f95-b85c-4836-9463-3b951e7e92c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jyle...@gmail.com

unread,

Nov 9, 2015, 1:17:27 PM11/9/15

to ldsc_users, jyle...@gmail.com

Hello,

Here are my log files. First one is for "LD score estimation" and then second one is for "LD score regression".

*********************************************************************

* LD Score Regression (LDSC)

* Version 1.0.0

* Broad Institute of MIT and Harvard / MIT Department of Mathematics

* GNU General Public License v3

*********************************************************************

Call:

./ldsc.py \

--ld-wind-cm 1.0 \

--out M_ldsc \

--bfile M_OTK_csr_qc3_chr22 \

--yes-really \

--maf 0.01 \

--l2

Beginning analysis at Mon Nov 9 18:48:33 2015

Read list of 25499 SNPs from M_OTK_csr_qc3_chr22.bim

Read list of 6877 individuals from M_OTK_csr_qc3_chr22.fam

Reading genotypes from M_OTK_csr_qc3_chr22.bed

Estimating LD Score.

Writing LD Scores for 25497 SNPs to M_ldsc.l2.ldscore.gz

Summary of LD Scores in M_ldsc.l2.ldscore.gz

MAF L2

mean 0.197 80.434

std 0.144 62.479

min 0.010 1.178

25% 0.066 32.250

50% 0.160 64.316

75% 0.320 110.067

max 0.500 331.227

MAF/LD Score Correlation Matrix

MAF L2

MAF 1.000 -0.027

L2 -0.027 1.000

Analysis finished at Mon Nov 9 19:38:15 2015

Total time elapsed: 49.0m:41.91s

---------------------------------------------------------------------

And then using these LD scores, I tried LD score regression. I creasted sumstats files from my GWAS results and used munge_sumstats.py.

Thanks for your help!

---------------------------------------------------------------------

*********************************************************************

* LD Score Regression (LDSC)

* Version 1.0.0

* Broad Institute of MIT and Harvard / MIT Department of Mathematics

* GNU General Public License v3

*********************************************************************

Call:

./ldsc.py \

--h2 chr22_sumstats.sumstats.gz \

--out chr22_ldsc \

--w-ld M_ldsc \

--ref-ld M_ldsc

Beginning analysis at Tue Nov 10 01:58:52 2015

Reading summary statistics from chr22_sumstats.sumstats.gz ...

Read summary statistics for 21667 SNPs.

Reading reference panel LD Score from M_ldsc ...

Read reference panel LD Scores for 25497 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from M_ldsc ...

Read regression weight LD Scores for 25497 SNPs.

After merging with reference panel LD, 21665 SNPs remain.

After merging with regression SNP LD, 21665 SNPs remain.

WARNING: number of SNPs less than 200k; this is almost always bad.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: -0.0057 (0.0041)

Lambda GC: 1.2267

Mean Chi^2: 1.2291

Intercept: 1.3782 (0.0864)

Ratio: 1.6511 (0.377)

Analysis finished at Tue Nov 10 01:58:52 2015

Total time elapsed: 0.15s

On Tuesday, November 10, 2015 at 3:04:50 AM UTC+9, Raymond Walters wrote:

Hello,
The ldsc method can work with GWAS from a single dataset, though sample size will often be an issue. Can you post your ldsc log files so we can help troubleshoot?
Thanks,
Raymond

-----
Raymond K. Walters
Research Fellow
Analytic & Translational Genetics Unit
Massachusetts General Hospital
rwal...@broadinstitute.org

On Nov 9, 2015, at 12:56 PM, jyle...@gmail.com wrote:

My data is from one population and there's an inflation (lambda GC is about 1.2).

One GWAS is also suitable for ldsc method?

I tried this method and intercept from ldsc is even higher than lambda GC (intercept was 1.3). So I wonder if this ldsc mehod is not suitable for my data.

Thanks in advance,

Raymond Walters

unread,

Nov 10, 2015, 11:44:42 AM11/10/15

to jyle...@gmail.com, ldsc_users

Hello,

Thanks for posting both logs, they’re very helpful!

There are two potential issues to address here. First, when computing the LD scores the “--yes-really” flag is not necessary or recommended (it’s only intended for debugging). It’s probably causing you to catch long range LD from population structure, which may explain your higher than expected distribution of L2 values (for comparison, see values in the github tutorial). If you were previously seeing errors suggesting “--yes-really” even when specifying “--ld-wind-cm”, it’s possible your input data is missing cM distances. See "plink --cm-map” for how to fill in this info.

Second, when estimating h2 it looks like you are using the LD scores from chr 22 only. This isn’t recommended, and is the reason you are getting a warning for the small number of SNPs. If you are interested in h2 specifically from chr22 you might consider using partitioned heritability estimation to split by chromosome, but otherwise you probably want to use all chromosomes. Note you can still estimate the LD scores per chromosome, and just use the “--ref-ld-chr” and “--w-ld-chr” arguments in place of “--ref-ld” and “--w-ld” to supply the set of files.

Hope that helps! Let us know if you hit any other issues.

Cheers,

Raymond

To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/f2a1b6f3-affe-4b8e-bebf-2ba55b1895c1%40googlegroups.com.

jyle...@gmail.com

unread,

Nov 11, 2015, 9:23:56 AM11/11/15

to ldsc_users, jyle...@gmail.com

Thanks Raymond, your comments are really helpful. I still have several questions.

To define window sizes for LD score estimation, I want to use genetic distance (cM) rather than physical position and the genetic map files available at UCSC do not include all markers my data have. Maybe it's because I'm using data after imputation and number of markers of my data is larger than the number in genetic map files. Is there any way to leave genetic map of these markers as unknown?

And phenotypes of subjects in the Example data named "22.fam" in the file "1kg_eur.tar.bz2" are all missing (coded as -9). And I'm tyring to estimate LD score for case/control data. Is it fine to use all case/control people when estimating LD score? Or do I need to do this analysis separately for cases or controls only?

Thanks again!

Esther

Raymond Walters

unread,

Nov 12, 2015, 10:48:03 AM11/12/15

to jyle...@gmail.com, ldsc_users

Hi Esther,

All good questions. For the genetic distance, if the number of markers not present in the UCSC files is small it probably wouldn’t be harmful to simply omit those markers from LD score estimation (for instance, we’ve seen good performance when restricting to only SNPs present in HapMap3). Alternatively, you may be able to find a map file corresponding to your selected imputation reference. For example, genetic maps for the 1000 Genomes Phase 3 reference are available here.

The sample data in 22.fam are from 1000 Genomes, hence the missing phenotype. I don’t think anyone’s looked closely at estimating LD scores from case/control data, but the key consideration is that you want LD scores to be a sample that reflects the population LD structure. Depending on the prevalence of your phenotype and any ascertainment in your sampling design it’s possible you may want to estimate LD scores from controls only to avoid ascertainment-induced LD among risk variants.

Alternatively, given that you’re already imputing your data it may be appropriate (and easier) to simply compute LD scores from that reference population rather than in your study sample.

Cheers,

Raymond

To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/799ab999-e441-4796-9a9c-5e4040db589e%40googlegroups.com.

Reply all

Reply to author

Forward