Is my data suitable for LD score regression method?

757 views
Skip to first unread message

jyle...@gmail.com

unread,
Nov 9, 2015, 12:56:16 PM11/9/15
to ldsc_users
My data is from one population and there's an inflation (lambda GC is about 1.1).
One GWAS is also suitable for ldsc method?
I tried this method and intercept from ldsc is even higher than lambda GC (intercept was 1.2). So I wonder if this ldsc mehod is not suitable for my data.

Thanks in advance,

Raymond Walters

unread,
Nov 9, 2015, 1:04:50 PM11/9/15
to jyle...@gmail.com, ldsc_users
Hello,
The ldsc method can work with GWAS from a single dataset, though sample size will often be an issue. Can you post your ldsc log files so we can help troubleshoot?
Thanks,
Raymond


-----
Raymond K. Walters
Research Fellow
Analytic & Translational Genetics Unit
Massachusetts General Hospital



--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/9a469f95-b85c-4836-9463-3b951e7e92c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jyle...@gmail.com

unread,
Nov 9, 2015, 1:17:27 PM11/9/15
to ldsc_users, jyle...@gmail.com
Hello,

Here are my log files. First one is for "LD score estimation" and then second one is for "LD score regression".

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--ld-wind-cm 1.0 \
--out M_ldsc \
--bfile M_OTK_csr_qc3_chr22 \
--yes-really  \
--maf 0.01 \
--l2  

Beginning analysis at Mon Nov  9 18:48:33 2015
Read list of 25499 SNPs from M_OTK_csr_qc3_chr22.bim
Read list of 6877 individuals from M_OTK_csr_qc3_chr22.fam
Reading genotypes from M_OTK_csr_qc3_chr22.bed
Estimating LD Score.
Writing LD Scores for 25497 SNPs to M_ldsc.l2.ldscore.gz

Summary of LD Scores in M_ldsc.l2.ldscore.gz
        MAF       L2
mean  0.197   80.434
std   0.144   62.479
min   0.010    1.178
25%   0.066   32.250
50%   0.160   64.316
75%   0.320  110.067
max   0.500  331.227

MAF/LD Score Correlation Matrix
       MAF     L2
MAF  1.000 -0.027
L2  -0.027  1.000
Analysis finished at Mon Nov  9 19:38:15 2015
Total time elapsed: 49.0m:41.91s

---------------------------------------------------------------------

And then using these LD scores, I tried LD score regression. I creasted sumstats files from my GWAS results and used munge_sumstats.py.
Thanks for your help!

---------------------------------------------------------------------

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 chr22_sumstats.sumstats.gz \
--out chr22_ldsc \
--w-ld M_ldsc \
--ref-ld M_ldsc 

Beginning analysis at Tue Nov 10 01:58:52 2015
Reading summary statistics from chr22_sumstats.sumstats.gz ...
Read summary statistics for 21667 SNPs.
Reading reference panel LD Score from M_ldsc ...
Read reference panel LD Scores for 25497 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from M_ldsc ...
Read regression weight LD Scores for 25497 SNPs.
After merging with reference panel LD, 21665 SNPs remain.
After merging with regression SNP LD, 21665 SNPs remain.
WARNING: number of SNPs less than 200k; this is almost always bad.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: -0.0057 (0.0041)
Lambda GC: 1.2267
Mean Chi^2: 1.2291
Intercept: 1.3782 (0.0864)
Ratio: 1.6511 (0.377)
Analysis finished at Tue Nov 10 01:58:52 2015
Total time elapsed: 0.15s




On Tuesday, November 10, 2015 at 3:04:50 AM UTC+9, Raymond Walters wrote:
Hello,
The ldsc method can work with GWAS from a single dataset, though sample size will often be an issue. Can you post your ldsc log files so we can help troubleshoot?
Thanks,
Raymond


-----
Raymond K. Walters
Research Fellow
Analytic & Translational Genetics Unit
Massachusetts General Hospital



On Nov 9, 2015, at 12:56 PM, jyle...@gmail.com wrote:

My data is from one population and there's an inflation (lambda GC is about 1.2).
One GWAS is also suitable for ldsc method?
I tried this method and intercept from ldsc is even higher than lambda GC (intercept was 1.3). So I wonder if this ldsc mehod is not suitable for my data.

Thanks in advance,

Raymond Walters

unread,
Nov 10, 2015, 11:44:42 AM11/10/15
to jyle...@gmail.com, ldsc_users
Hello,
Thanks for posting both logs, they’re very helpful!

There are two potential issues to address here. First, when computing the LD scores the “--yes-really” flag is not necessary or recommended (it’s only intended for debugging). It’s probably causing you to catch long range LD from population structure, which may explain your higher than expected distribution of L2 values (for comparison, see values in the github tutorial). If you were previously seeing errors suggesting “--yes-really” even when specifying “--ld-wind-cm”, it’s possible your input data is missing cM distances. See "plink --cm-map” for how to fill in this info.

Second, when estimating h2 it looks like you are using the LD scores from chr 22 only. This isn’t recommended, and is the reason you are getting a warning for the small number of SNPs. If you are interested in h2 specifically from chr22 you might consider using partitioned heritability estimation to split by chromosome, but otherwise you probably want to use all chromosomes. Note you can still estimate the LD scores per chromosome, and just use the “--ref-ld-chr” and “--w-ld-chr” arguments in place of “--ref-ld” and “--w-ld” to supply the set of files. 

Hope that helps! Let us know if you hit any other issues.

Cheers,
Raymond




jyle...@gmail.com

unread,
Nov 11, 2015, 9:23:56 AM11/11/15
to ldsc_users, jyle...@gmail.com
Thanks Raymond, your comments are really helpful. I still have several questions.

To define window sizes for LD score estimation, I want to use genetic distance (cM) rather than physical position and the genetic map files available at UCSC do not include all markers my data have. Maybe it's because I'm using data after imputation and number of markers of my data is larger than the number in genetic map files. Is there any way to leave genetic map of these markers as unknown? 

And phenotypes of subjects in the Example data named "22.fam" in the file "1kg_eur.tar.bz2" are all missing (coded as -9). And I'm tyring to estimate LD score for case/control data. Is it fine to use all case/control people when estimating LD score? Or do I need to do this analysis separately for cases or controls only?

Thanks again!
Esther

Raymond Walters

unread,
Nov 12, 2015, 10:48:03 AM11/12/15
to jyle...@gmail.com, ldsc_users
Hi Esther,
All good questions. For the genetic distance, if the number of markers not present in the UCSC files is small it probably wouldn’t be harmful to simply omit those markers from LD score estimation (for instance, we’ve seen good performance when restricting to only SNPs present in HapMap3). Alternatively, you may be able to find a map file corresponding to your selected imputation reference. For example, genetic maps for the 1000 Genomes Phase 3 reference are available here

The sample data in 22.fam are from 1000 Genomes, hence the missing phenotype. I don’t think anyone’s looked closely at estimating LD scores from case/control data, but the key consideration is that you want LD scores to be a sample that reflects the population LD structure. Depending on the prevalence of your phenotype and any ascertainment in your sampling design it’s possible you may want to estimate LD scores from controls only to avoid ascertainment-induced LD among risk variants. 

Alternatively, given that you’re already imputing your data it may be appropriate (and easier) to simply compute LD scores from that reference population rather than in your study sample.

Cheers,
Raymond



Reply all
Reply to author
Forward
0 new messages