heritability is negative

2,031 views
Skip to first unread message

Xianyong Yin

unread,
Mar 22, 2016, 3:20:49 PM3/22/16
to ldsc_users
Hi,

I used LDSC to estimate the heritability from a GWAS summary statistics which comes from a linear mixed model. The sample size is > 6000. I got a negative heritability estimate which is strange to me.
Here is my log file:

Call:
./ldsc.py \
--h2 dbgap.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--out dbgap \
--w-ld-chr eur_w_ld_chr/

Beginning analysis at Tue Mar 22 14:49:20 2016
Reading summary statistics from dbgap.sumstats.gz ...
Read summary statistics for 294672 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1293150 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1293150 SNPs.
After merging with reference panel LD, 294552 SNPs remain.
After merging with regression SNP LD, 294552 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: -0.9444 (0.0851)
Lambda GC: 1.0165
Mean Chi^2: 1.0184
Intercept: 1.1341 (0.0096)
Ratio: 7.3008 (0.5221)

Any suggestions?

Thank you.

Xianyong

Raymond Walters

unread,
Mar 22, 2016, 5:35:39 PM3/22/16
to Xianyong Yin, ldsc_users
Hi Xianyong,
The ldsc h2 estimate is unbounded so negative heritability estimates are possible, especially when true h2 is small, but that wouldn’t be expected to produce estimates as strongly negative as you are observing. 

Some information that may be helpful to start diagnosing this:
1) Are your sumstats for a full genome-wide set of SNPs, or some targeted selection (e.g. exome variants, variants associated with some other phenotype, etc)?
2) Is your GWAS from a sample of european ancestry?
3) Do you have any strong genome-wide significant loci in your GWAS? Of particular note would be any strong effects for SNPs in regions with little-to-no LD.
4) For your LMM, when testing a given marker was that marker excluded from the GRM for the mixed model component (e.g. via leave-one-chromosome-out or similar)?

Cheers,
Raymond



------
Raymond K. Walters
Postdoctoral Research Fellow
Analytic & Translational Genetics Unit
Massachusetts General Hospital
rwal...@broadinsitute.org




--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/d9384590-b54d-4015-bd9b-1d6664ed51fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xianyong Yin

unread,
Mar 22, 2016, 5:42:31 PM3/22/16
to ldsc_users, xiany...@gmail.com
Hi Raymond,

Thank you for your kind help. I understand the negative is possible. 

Here is further more information about my data.

Thanks,
Xianyong

On Tuesday, March 22, 2016 at 5:35:39 PM UTC-4, Raymond Walters wrote:
Hi Xianyong,
The ldsc h2 estimate is unbounded so negative heritability estimates are possible, especially when true h2 is small, but that wouldn’t be expected to produce estimates as strongly negative as you are observing. 

Some information that may be helpful to start diagnosing this:
1) Are your sumstats for a full genome-wide set of SNPs, or some targeted selection (e.g. exome variants, variants associated with some other phenotype, etc)?

Xianyong: It is the summary statistics for the whole genome set SNPs after imputation.  The imputation was implemented based on regular genome-wide array in IMPUTE2. I used all biallelic variants with at least imputation quality score 0.3.
 
2) Is your GWAS from a sample of european ancestry?

Xianyong: Yes, I think so. 
 
3) Do you have any strong genome-wide significant loci in your GWAS? Of particular note would be any strong effects for SNPs in regions with little-to-no LD.

Xianyong: No, I don't find any of them. The most significant signal is above 10^-8 

4) For your LMM, when testing a given marker was that marker excluded from the GRM for the mixed model component (e.g. via leave-one-chromosome-out or similar)?

Xianyong: No, I ran it with all variants in.  

Raymond Walters

unread,
Mar 23, 2016, 11:10:29 AM3/23/16
to Xianyong Yin, ldsc_users
Hi Xianyong,
Thanks for the information! Keeping my replies in-line…


Some information that may be helpful to start diagnosing this:
1) Are your sumstats for a full genome-wide set of SNPs, or some targeted selection (e.g. exome variants, variants associated with some other phenotype, etc)?

Xianyong: It is the summary statistics for the whole genome set SNPs after imputation.  The imputation was implemented based on regular genome-wide array in IMPUTE2. I used all biallelic variants with at least imputation quality score 0.3.

Raymond: Your initial log file lists 294k SNPs, which seems surprisingly small for a full imputed dataset. Did you process your sumstats file with munge_sumstats.py prior to running ldsc? If so, if would be very helpful to check that log file and make sure you aren’t losing more markers than expected at that step.  

 
2) Is your GWAS from a sample of european ancestry?

Xianyong: Yes, I think so. 

Raymond: I’d definitely recommend verifying this. If the LD structure from the European reference files isn’t a good match to the LD structure for your sample ldsc is liable to give strange (and uninterpretable) results.

 
3) Do you have any strong genome-wide significant loci in your GWAS? Of particular note would be any strong effects for SNPs in regions with little-to-no LD.

Xianyong: No, I don't find any of them. The most significant signal is above 10^-8 

Raymond: Good to know. In theory ldsc should be able to handle that scenario regardless, but this helps rule out that explanation.


4) For your LMM, when testing a given marker was that marker excluded from the GRM for the mixed model component (e.g. via leave-one-chromosome-out or similar)?

Xianyong: No, I ran it with all variants in.  

Raymond: This is potentially problematic. If the marker being testing is also in the GRM, then the GRM will absorb some of the signal from the SNP (if any) and decrease your power (see the discussion of MLMi vs. MLMe in Yang et al., 2014, Nature Genetics). Further, if you didn’t prune for LD prior to estimating the GRM, then I would expect that this effect may be stronger for SNPs in high LD regions since they are potentially more strongly tagged by the GRM. A bias against high LD SNPs would run directly opposite of the expected pattern of GWAS results modeled by ldsc, and could contribute to the stronger negative heritability estimate.

Cheers,
Raymond






Xianyong Yin

unread,
Mar 26, 2016, 2:28:49 PM3/26/16
to ldsc_users, xiany...@gmail.com
Hi Raymond,

Thank you for your great suggestions.  I ran the leave-one-chromosome-out LMM test in GCTA and went throug ldsc regression again. Here is my response and the log file. 

Looking forward to your further suggestions. Thank you.

Xianyong 

log File I:

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./munge_sumstats.py \
--out dbgap.ea.nd \
--merge-alleles w_hm3.snplist \
--N 6970.0 \
--sumstats dbgap.ea.nd.mlma.assoc

Interpreting column names as follows:
A1:     Allele 1, interpreted as ref allele for signed sumstat.
p:      p-Value
A2:     Allele 2, interpreted as non-ref allele for signed sumstat.
b:      [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)
rs:     Variant ID (e.g., rs number)

Reading list of SNPs for allele merge from w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from dbgap.ea.nd.mlma.assoc into memory 5000000.0 SNPs at a time.
Read 3916453 SNPs from --sumstats file.
Removed 3282355 SNPs not in --merge-alleles.
Removed 0 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 52 variants that were not SNPs or were strand-ambiguous.
634046 SNPs remain.
Removed 0 SNPs with duplicated rs numbers (634046 SNPs remain).
Using N = 6970.0
Median value of b was 0.000417715, which seems sensible.
Removed 23 SNPs whose alleles did not match --merge-alleles (634023 SNPs remain).
Writing summary statistics for 1217311 SNPs (634023 with nonmissing beta) to /scratch/xyyin/impute/depLDSC/dbgap.ea.nd.sumstats.gz.

Metadata:
Mean chi^2 = 1.078
Lambda GC = 1.062
Max chi^2 = 91.437
43 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Sat Mar 26 14:21:46 2016
Total time elapsed: 19.46s

h^2 log File:

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./ldsc.py \
--h2 dbgap.ea.nd.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--out dbgap.ea.nd.h2 \
--w-ld-chr eur_w_ld_chr/

Beginning analysis at Sat Mar 26 14:22:05 2016
Reading summary statistics from dbgap.ea.nd.sumstats.gz ...
Read summary statistics for 634023 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1293150 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1293150 SNPs.
After merging with reference panel LD, 633086 SNPs remain.
After merging with regression SNP LD, 633086 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: -1.3723 (0.1027)
Lambda GC: 1.0618
Mean Chi^2: 1.0784
Intercept: 1.2702 (0.0113)
Ratio: 3.4475 (0.1438)
Analysis finished at Sat Mar 26 14:22:16 2016
Total time elapsed: 10.64s






On Wednesday, March 23, 2016 at 11:10:29 AM UTC-4, Raymond Walters wrote:
Hi Xianyong,
Thanks for the information! Keeping my replies in-line…


Some information that may be helpful to start diagnosing this:
1) Are your sumstats for a full genome-wide set of SNPs, or some targeted selection (e.g. exome variants, variants associated with some other phenotype, etc)?

Xianyong: It is the summary statistics for the whole genome set SNPs after imputation.  The imputation was implemented based on regular genome-wide array in IMPUTE2. I used all biallelic variants with at least imputation quality score 0.3.

Raymond: Your initial log file lists 294k SNPs, which seems surprisingly small for a full imputed dataset. Did you process your sumstats file with munge_sumstats.py prior to running ldsc? If so, if would be very helpful to check that log file and make sure you aren’t losing more markers than expected at that step.  

Xianyong-2: I checked my post-imputed data. There are > 3 million variants with MAF > 1%.  I ran munge_sumstats.py with --merge-alleles, and only 634023 sharing SNPs were included in the generated dbgap.ea.nd.sumstats.gz file.  I am not sure why the majority of my SNPs are not overlapped with the HapMap3 SNPs. Is it possible caused by my imputation were implemented using 1K reference panel?

Raymond Walters

unread,
Mar 28, 2016, 10:47:09 AM3/28/16
to Xianyong Yin, ldsc_users
Hi Xianyong,
The vast majority of HapMap3 SNPs are expected to be well imputed by 1000 Genomes (that’s part of the reason they are recommended), so I’m not sure why you’re getting so few matches. The munge_sumstats.py script matches on SNP names, so it’s possible the problem is different SNP naming conventions between your data and the w_hm3.snplist file? The w_hm3.snplist file is just a text file so that shouldn’t be too difficult to check.

From your munge_sumstats log, it does look like you have some strong effect loci (43 genome-wide significant SNPs, with max chisq of 91). LDSC tries to compensate for this, but it might be worth manually removing those loci (+/- roughly 1MB) and retrying. 

If those don’t resolve your negative h2 estimate I’m not sure what else to recommend. Using LDSC with LMM-based summary statistics is not something we’ve investigated fully, so it’s quite possible your results are best taken as evidence against the validity of LDSC for LMM results. 

Cheers,
Raymond




Reply all
Reply to author
Forward
0 new messages