LD score regression for Hispanics

979 views
Skip to first unread message

JMR JMR

unread,
Feb 3, 2018, 4:53:23 AM2/3/18
to ldsc_users
Hi Raymond,

I am trying to estimate the LD score regression intercept for a GWAS I've conducted on hispanic data. I want to test for polygencity only. 

I've followed the tutorial and decided to download all AMR from the 1KG and compute LD scores. I used:

python ldsc.py\
	--bfile AMR1kg
	--l2\
	--ld-wind-cm 1\
	--out AMR1kg

Should I've used something else? I used PLINK to read the 1KG VCF files and I didn't get any issue after keeping only bi-allelic SNPs. 

Then I used the munge_sumstats.py to get my GWAS summary statistic in the correct format keeping only Hapmap3 SNPs. That worked fine as well.

Now, If I only want to test for polygenicity in my trait, I think I only need to use the --h2 command. But I am confused regarding the --ref-ld-chr and --w-ld-chr commands.

Should the input files for these two flags be simply the *.l2.ldscore ? 

Thank you.

JMR JMR

unread,
Feb 3, 2018, 11:59:37 AM2/3/18
to ldsc_users
Small update:

I went ahead and run the --h2 command with the precomputed LD scores (my sample has high european admixture so I thought I might get something sensible). The GWAS summary file is after using PC's to correct for population structure and there are not relatives in the sample... 

Here is the call result:

Call: 
./ldsc.py \
--h2 ../summstat/mel.sumstats.gz \
--ref-ld-chr ../eur_w_ld_chr/ \
--out ../mel_eur \
--w-ld-chr ../eur_w_ld_chr/ 

Beginning analysis at Sat Feb  3 10:30:59 2018
Reading summary statistics from ../summstat/mel.sumstats.gz ...
Read summary statistics for 522447 SNPs.
Reading reference panel LD Score from ../eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1290028 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from ../eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1290028 SNPs.
After merging with reference panel LD, 514692 SNPs remain.
After merging with regression SNP LD, 514692 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 1.3441 (0.7782)
Lambda GC: 1.1651
Mean Chi^2: 1.3098
Intercept: 1.1678 (0.0267)
Ratio: 0.5417 (0.0862)
Analysis finished at Sat Feb  3 10:31:11 2018
Total time elapsed: 11.63s

I am not sure why I am getting such a high heritability estimate. I thought it has maybe to do with my sample size but it's around 6,000 individuals.My mel.sumstats.gz file was computed based on the assoc file from PLINK and I used the "NMISS" column which refers to the number of non-missing genotypes in my sample to set the "N" column to do the sumstats.gz file.

Any help will be greatly appreciated.

Raymond Walters

unread,
Feb 6, 2018, 7:38:10 PM2/6/18
to JMR JMR, ldsc_users
Hi,

Lots of topics to cover here:

1) Using LD score regression with GWAS of recently admixed populations generally isn’t recommended. The challenge that LD in these samples becomes a mix of within-ancestry LD patterns and the pattern of haplotypes from the difference ancestries across the genome for each individual (i.e. the local ancestry chromosome painting). For that reason relying on a reference panel for LD information in LDSC becomes tenuous. For instance there is anecdotal evidence that this makes LDSC unstable in African-American populations.

Since you mention only wanted to test of polygenicity I suspect you already know this, and are looking to run LDSC with a very limited interpretation, e.g. trying to reject a null of 100% population stratification/confounding in the GWAS without inferring a h2 estimate. I just wait to make that clear to the rest of this email list that we’re talking about this nuanced and somewhat speculative use case.

2) Computing LD scores from 1KG AMR individuals is probably the right place to start as a reference panel for approximating hispanics. Given your sample has high european admixture, you might also consider a combined AMR+EUR panel.

3) Based on your code snippet, I’m guessing you have a single *.l2.ldscore file rather than split by chromosome? In that case you’ll want to specify that file with the --ref-ld and --w-ld arguments (without -chr). Specifying the same file for both should be reasonable in this case, but see the supplement of the first partitioned h2 paper (Finucane et al 2015) if you want more discussion on the relationship between regression SNPs, reference SNPs, and regression weights.

4) Plink’s NMISS field should indeed work for sample size. You can check the contents of the *.sumstats.gz to verify sample size was included correctly. (The log from munge_sumstats.py should also give an indication of the values read.)

5) As far as the high h2 estimate, many things to consider:
a. If you’re sticking to the “polygenicity only” interpretation, then the point estimate of h2 theoretically shouldn’t matter.
b. It’s safe to say running ldsc with Eur. LD scores and admixed hispanic GWAS samples is going to be a badly misspecified model, so any estimate should be interpreted with a very healthy does of skepticism.
c. The SE is huge, so that h2 is not significantly large (the moderate sample size and model misspecification will both contribute here).
d. Your mean chi2 is quite strong for N=6000, suggesting high h2 and/or lots of stratification/confounding.
e. If that high mean chi2 is partially reflecting some loci with very strong GWAS hits, then you may want to consider looking at sensitivity to the two-step estimator settings (see this thread for details).
f. With 500k variants, you only have results for about half the HapMap3 sites. If those variants are a non-random selection of genome-wide SNPs likely to be enriched for signal (for example if they’re from a targeted genotyping array like exomechip or metabochip) then extrapolation from those to the full genome like ldsc does will yield biased results.

In short, there’s lots of instability and potential issues for this analysis, so will probably take a lot of probing and evaluating different approaches to the analysis before you can start making a case for what results are likely to be “real”.

Cheers,
Raymond



--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/85f11828-2cf2-4bd4-b662-b0801318e17f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JMR JMR

unread,
Feb 7, 2018, 12:07:35 PM2/7/18
to ldsc_users
Hi Raymond,

Thank you so much for the detailed answer. 

For 1), Yes. I basically just want to know if after correcting for population structure in my GWAS analysis the high inflation factor I found can be explained due to polygenicity. 

For 2) Thanks, all the AMR populations have extensive EUR admixture, so I assume using them would be enough here?

For 3) I now have LD scores for AMR. Following the manual, I only used HAPMAP3 SNPs to get the LD scores. Now, regarding the run, it is not very clear to me what I should use for the --ref-ld-chr flag and for the --w-ld-chr flag. Should I use the *.l2 files from AMR for both these flags, or should I use the --ref-ld-chr flag for the AMR LD scores and the --w-ld-chr for the LD scores based on my chip data (i.e. my target population)?

For 4) Thanks, I checked and it was converted to the appropriate format correctly. 

For 5) I am not interested in the heritability estimate as you mentioned, but was surprised regarding the high (higher than 1) estimate. Also, I have a couple of SNPs with very strong genetic effect. I also have a run with a conditional analysis on these high effect variants. Should I have used the conditional analysis instead?

Thanks,

Javier

Raymond Walters

unread,
Feb 7, 2018, 3:39:15 PM2/7/18
to JMR JMR, ldsc_users
Hi Javier,

For 2) I fully agree AMR is probably a good fit, just wanted to mention the alternative in case you wanted to explore that further.

For 3) Should use the AMR scores for both the ref-ld and w-ld flags. If you have within-sample LD scores you’d like to use, those can be used for a separate analysis specifying those files for both the ref-ld and w-ld flags instead.

More generally, the LD scores for the ref-ld and w-ld flags are normally expected to be from the same sample. The separate flags exists to allow for different files in the case of either partitioned ldsc (where ref-ld includes all the partitioned scores and w-ld contains a single set of genome-wide scores) or if you want to used a different set of SNPs for defining the LD-based regression weights (w-ld) vs. the scores used within the regression (ref-ld). That latter option is what’s elaborated on in the supplement of Finucane et al. 2015.

For 5) Yes, running ldsc with the conditional analysis would be useful to check whether those strong effect loci are disrupting the LD score regression. I’d view that as complementary to your primary ldsc analysis, rather than replacing it.

Under normal circumstances I’d agree that h2 > 1.3 is high enough to be surprising and need further investigation. The ldsc estimator of h2 is unbounded so values > 1 are possible, but > 1.3 is definitely unusual. However, since you’re in an unusual use case, with lots of reasons to anticipate unstable results and possible miscalibration of the h2 estimate, I’m not as concerned about the point estimate (vs. all the other aspects to evaluate).

Cheers,
Raymond



JMR JMR

unread,
Feb 8, 2018, 5:19:36 AM2/8/18
to ldsc_users
Hi Raymond,

Following your suggestions I estimated the intercept for my data with AMR LD-scores.

The summary statistics were computed using only Hapmap3 SNPs (~500K SNPs). The LD-scores using (100kb windows) for AMR were also estimated using only Hapmap3 SNPs. Is this correct?

Then I went ahead and used the --h2 flag setting for both --ref-ld-chr and --wd-ld-chr the LD-scores previously computed for AMR. The overlap was ~500K SNPs.

The h2 again is high. The intercept however is much smaller (1.08) compared to the previous one (1.16) using the precomputed EUR LD-scores. However, this would mean that approximatly 28% of the inflation is due to population structure, right?

I am going to try to see what I get using the conditional analysis.

Thanks again for the help. 

Javier




Call: 

./ldsc.py \

--h2 summstat/mel_hapmap3.sumstats.gz \

--ref-ld-chr data/onlyhapmap3/ \

--out melAMRhapmap3 \

--w-ld-chr data/onlyhapmap3/ 


Beginning analysis at Thu Feb  8 10:12:35 2018

Reading summary statistics from summstat/mel_hapmap3.sumstats.gz ...

Read summary statistics for 522447 SNPs.

Reading reference panel LD Score from data/onlyhapmap3/[1-22] ...

Read reference panel LD Scores for 1205057 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from data/onlyhapmap3/[1-22] ...

Read regression weight LD Scores for 1205057 SNPs.

After merging with reference panel LD, 518151 SNPs remain.

After merging with regression SNP LD, 518151 SNPs remain.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: 1.4707 (0.6154)

Lambda GC: 1.1683

Mean Chi^2: 1.3135

Intercept: 1.0895 (0.014)

Ratio: 0.2856 (0.0446)

Analysis finished at Thu Feb  8 10:12:43 2018

Total time elapsed: 8.51s

JMR JMR

unread,
Feb 8, 2018, 5:33:40 AM2/8/18
to ldsc_users
I went on an computed the intercept using my conditional analysis on the top (high effect and significant SNPs).

The intercept is still not close to zero (1.07) and the ratio is now bigger (0.68).

Not sure how to interpret this.

Thank you,

Javier

Call: 

./ldsc.py \

--h2 summstat/melcond1_hapmap3.sumstats.gz \

--ref-ld-chr data/onlyhapmap3/ \

--out melcond1AMRhapmap3 \

--w-ld-chr data/onlyhapmap3/ 


Beginning analysis at Thu Feb  8 10:31:15 2018

Reading summary statistics from summstat/melcond1_hapmap3.sumstats.gz ...

Read summary statistics for 522436 SNPs.

Reading reference panel LD Score from data/onlyhapmap3/[1-22] ...

Read reference panel LD Scores for 1205057 SNPs.

Removing partitioned LD Scores with zero variance.

Reading regression weight LD Score from data/onlyhapmap3/[1-22] ...

Read regression weight LD Scores for 1205057 SNPs.

After merging with reference panel LD, 518140 SNPs remain.

After merging with regression SNP LD, 518140 SNPs remain.

Using two-step estimator with cutoff at 30.

Total Observed scale h2: 0.2605 (0.0767)

Lambda GC: 1.1082

Mean Chi^2: 1.117

Intercept: 1.0794 (0.0078)

Ratio: 0.6785 (0.0664)

Analysis finished at Thu Feb  8 10:31:23 2018

Total time elapsed: 8.27s

Raymond Walters

unread,
Feb 19, 2018, 12:23:13 PM2/19/18
to JMR JMR, ldsc_users
Hi Javier,

There are a few things that seem noteworthy in the comparison between these runs.

1) The AMR scores do appear to provide a better fit to your GWAS results. The decrease in inflation of the intercept (from 1.17 to 1.09) and the narrower SEs are the key indicators here.

2) Dropping the outlier significant loci drastically improves the stability of the results (again, indicated by the substantial narrowing of SEs). The much larger impact of the outlier exclusion on the h2 vs. the intercept almost certainly reflects the impact of the two-step estimation method for the intercept. I’d expect that the intercept in that first AMR analysis is highly sensitive to the two-step settings.

3) The outlier exclusion only removes 11 SNPs, yet reduces the mean chi2 from 1.3 to 1.1. This suggests that those SNPs have extraordinarily strong effects, and does a lot to explain the unstable and out-of-bounds h2 estimate. Excluding those variants is the preferable route, where you could consider estimating their variance explained as fixed effects separate from the polygenic h2 from ldsc with the rest of the genome-wide variants.

4) The remaining intercept (1.08) may reflect population stratification in the GWAS and/or remaining model misspecification from the admixed AMR LD scores and GWAS population. I wouldn’t worry too much about the change in ratio between the AMR analysis with or without outliers (that’s largely reflecting the big change in mean chi2). The high ratio in the non-outlier analysis is still worth thinking about though.

To the overall goal of evaluating whether there’s polygenic signal, the non-outlier AMR analysis is a big improvement over the previous ones. More sensitivity analyses and investigation of the remaining intercept are probably warranted, but I’d be cautiously optimistic about the current results.

Cheers,
Raymond


Reply all
Reply to author
Forward
0 new messages