Basic questions regarding LD Score computation and genotyping

2,076 views
Skip to first unread message

Johanne Håøy Horn

unread,
Jun 25, 2016, 2:56:48 PM6/25/16
to ldsc_users
Hi,

I am quite new to the ldsc software (and bioinformatics in general, for that matter), and struggle somewhat with the concept of an LD Score for a SNP. I've read your two articles on single-trait and cross-trait regression. My limited experience with LD have been concerned with r^2 between different SNPs. When you quantify an LD score for one particular SNP, what does the L2 reflect? How is the LD score computed for a particular SNP, given the variants that are in LD with it?

Also, do you use 1000G or HapMap genotype information when computing LD scores? I see both being referenced in the tutorials and article, but I am not familiar enough with genotype data sets to recognize which set is used and for what purpose. It seems as though you recommend using HapMap phase 3 LD data in other google group posts, but I didn't fully understand why. Could you explain the reason behind this recommendation?

On a different note, I also wondered about this sentence from the URL section of "An atlas of genetic correlations across human diseases and traits": "IIBDGC (inflammatory bowel disease) summary statistics (we used a newer version of these data with 1000 Genomes Project imputation)". I see the IIBDGC summary statistics you link to are marked as deprecated. How do these data sets differ from the newer version you used? 

Forgive me if this information is available already, either through an article of yours, in this google group or at the wiki.

Best regards,
Johanne

Raymond Walters

unread,
Jun 27, 2016, 10:52:08 AM6/27/16
to Johanne Håøy Horn, ldsc_users
Hi Johanne,
These are all good questions.

For a given SNP, it’s LD score is the sum of it’s r^2 with all SNPs (or to use the notation from the papers l_j = sum_k r^2_{jk}). So for a SNP j in LD with lots of other SNPs, it will have lots of larger values of r^2 with those SNPs and thus a larger LD score. The methods section from the first paper adds some nuance about what SNPs to sum over and using and adjusted r^2 estimator, but the core idea stands: the LD score of a SNP is the sum of it’s r^2 with all SNPs.

The HapMap/1000Genomes discussion is something that has continued to evolve over the course of work on LD score regression. There’s also additional potential confusion since HapMap/1000Genomes is relevant to at least three separate decisions: (1) what reference dataset to use for computing LD scores for each SNP, (2) what SNPs to include in the sum when summing r^2 to create LD scores, and (3) what SNPs to use in the regression for LD score regression. The recommendation of HapMap3 SNPs is for item 3, i.e. to define which SNPs to include in the regression (and corresponds to the use of ‘--merge-alleles w_hm3.snplist’ in munge_sumstats.py as done in the tutorial on the wiki). 

The reason for this recommendation is that is provides a easy, consistent way to make sure stable, high-quality SNPs are used for the regression without requiring other metrics (e.g. imputation info score, maf) that may not be available for all datasets. This is particularly valuable to have as a baseline when trying to do genetic correlation analyses with publicly available GWAS results that may not including these additional metrics for filtering. Since we can expect the HapMap3 SNPs to be reliable, well-imputed, and present in almost all GWASs, and since empirically there is little cost to restricting to these SNPs vs. some denser list of genome-wide markers, they’ve become the standard recommendation. (Plus for the google group having a standard baseline is also useful for troubleshooting, and merging against this list can be a useful check to expose any potential irregularities with the input data).

For the IBD data, I wasn’t involved in those analyses but my guess would be that the atlas paper used an interim update somewhere between the previous (now deprecated) release and the current release. Maybe Brendan/Hilary/Ben/others can provide more info here?

Cheers,
Raymond




--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/bbc92c8b-3b32-4059-b3ad-7dc2140ac2bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Johanne Håøy Horn

unread,
Jun 27, 2016, 2:33:37 PM6/27/16
to ldsc_users, johan...@gmail.com
Thank you so much for your swift reply! It made things much clearer.

Concerning the IBD data, in stead of looking further at the deprecated data, I downloaded the newest sets (http://www.ibdgenetics.org/downloads.html)
However, when I try to run munge_sumstats on the ulcerative colitis and crohns disease data sets, I get a warning "sys:1: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.", followed up by the following error in converting summary statistics:

Traceback (most recent call last):
  File "/Users/Johanne/Software/ldsc/munge_sumstats.py", line 654, in munge_sumstats
    check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
  File "/Users/Johanne/Software/ldsc/munge_sumstats.py", line 363, in check_median
    m = np.median(x)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/function_base.py", line 2717, in median
    return mean(part[indexer], axis=axis, out=out)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/fromnumeric.py", line 2716, in mean
    out=out, keepdims=keepdims)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py", line 67, in _mean
    ret = ret.dtype.type(ret / rcount)
AttributeError: 'float' object has no attribute 'type'

I have the latest version of ldsc. Do you have any suggestions as to what might be wrong?

Best,
Johanne

Raymond Walters

unread,
Jun 27, 2016, 3:23:09 PM6/27/16
to Johanne Håøy Horn, ldsc_users
Hi Johanne,
From the error I suspect that there are non-numeric values in the summary statistic column, possibly “NA"s? Would need to remove those prior to running munge_sumstats.py. Otherwise if that’s not the case, it’s probably some other unusual feature disrupting the input file parsing.

Two other notes:
1) To whatever extent those IBD results are “trans-ancestry” (as described on the downloads page) they’ll be misspecified by the LD score model since they’ll correspond to a different LD structure than any of the single-ancestry reference panels. If results are given for single-ancestry subsets of the data they’ll be highly preferable for ldsc analyses.
2) In case it might be more convenient, there are IBD results for the european subset available for comparison through LDHub (ldsc.broadinstitute.org)

Cheers,
Raymond



Johanne Håøy Horn

unread,
Jun 27, 2016, 3:42:35 PM6/27/16
to ldsc_users, johan...@gmail.com
I see! Thank you so much for your help!

The zip of the latest IBD files contain both EUR and trans-ethnic data sets, so I can stick to single-ancestry data. If I am unable to locate the weird values causing the error, I will use the files you link to :)

Best,
Johanne

Laramie Duncan

unread,
Jun 27, 2016, 5:37:42 PM6/27/16
to ldsc_users, johan...@gmail.com
Hey Raymond :) 

Hi Johanne,

Funny timing; I just got this message as well "sys:1: DtypeWarning: Columns (8) have mixed types". I suspected it might be the alleles, given that they can look like this:  CTTTTTTTTTTTT+1

So I made a new input file with only SNPs, and it worked (i.e. I removed all indels).

Best,
Laramie

Johanne Håøy Horn

unread,
Jul 6, 2016, 5:20:36 AM7/6/16
to ldsc_users, johan...@gmail.com
I have checked the signed statistics column (OR), they convert to float when parsing through normally. I also removed SNPs on the following format, in case their alleles messed up the parsing:
10 chr10_2622752_D 2622752 I2 D 0.97 0.968 0.929 1.06184 0.0691 0.3857 -+-+--- 37.9 0.04492

I still get the same error, however. The column headers of the file are the following: 
CHR SNP BP A1 A2 FRQ_A_5956 FRQ_U_14927 INFO OR SE P Direction HetISqt HetPVa

I have parsed through columns CHR, BP, FRQ_A..., FRQ_U..., INFO, OR and SE, and they all contain numerical values. 
The data set I use is "EUR.CD.gwas.assoc" from the zip in the top download link here:  http://www.ibdgenetics.org/downloads.html
Any suggestions to what else might be wrong when running munge_sumstats?

Best,
Johanne

Johanne Håøy Horn

unread,
Jul 6, 2016, 7:19:00 AM7/6/16
to ldsc_users, johan...@gmail.com
A detail I forgot to disclose: The data set I use is the same as the one LD hub links to for Crohn's disease and Ulcerative Colitis.

I also have a question regarding the resulting sumstats file. For one of the data sets that succeeded munging, I got this:
SNP A1 A2 Z N
rs3094315 A G -0.098 25500.000
rs3131972
rs3131969
rs1048488
rs3115850
rs2286139
rs12562034 G A -1.287 25500.000
rs4040617 A G 0.573 25500.000

Why have not all of the SNPs in the sumstats file information of A1, A2, Z and N?

Best,
Johanne

Raymond Walters

unread,
Jul 6, 2016, 12:35:31 PM7/6/16
to Johanne Håøy Horn, ldsc_users
Hi Johanne,

For the Dtype error:
After some digging, it looks like the indel allele names are fine and that the actual issue is extreme numeric values. Specifically, EUR.CD.gwas.assoc contains odds ratios > 10^308, which are big enough to overflow python’s limits on numeric values. 

The exact value of the overflow threshold will depend on your system. You can look up your local limit by running the following code within python:

import sys
sys.float_info

# Example output, emphasis added:
# sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

After filtering the sumstats to OR < 1e300 I was able to run munge_sumstats.py successfully with no further changes.
[As an aside, working from public summary statistics made this much easier to debug. :) ]


Empty entries in sumstats.gz:
The SNPs without A1/A2/Z/N values are not present in your dataset or don’t pass the other filters in munge_sumstats for your data (e.g. MAF, INFO, alleles not consistent with the --merge-alleles file, etc). 

The entries for those SNPs are still included to allow all sumstats.gz files processed with the same merge-alleles file to have an identical structure. If you check the log from munge_sumstats, you should have a line like:

Writing summary statistics for [X] SNPs ([Y] with nonmissing beta) to [out].sumstats.gz.

where [X] is the number of SNPs in --merge-alleles and [Y] is the number that actually have A1/A2/Z/N entries.


Cheers,
Raymond



Johanne Håøy Horn

unread,
Jul 6, 2016, 1:23:49 PM7/6/16
to ldsc_users, johan...@gmail.com
Oh! I see :)

I actually thought it had something to do with the format of the OR.
In a desperate attempt of fixing something, I ran through the file and wrote all lines to a new file with str(float(column_value)). I was puzzled to see that it worked, but have no idea why. 

I can't find any very high OR values in the file I am using, do you have an example rsid I can look up to check whether or not I have managed to filter it out somehow without noticing?

And thank you for you (super) fast support!

Best,
Johanne

Raymond Walters

unread,
Jul 7, 2016, 1:43:36 PM7/7/16
to Johanne Håøy Horn, ldsc_users
Hi Johanne,
Here’s the first one I found, about 2200000-2300000 lines into the file:

zcat EUR.CD.gwas.assoc.gz | awk '$NR==1 || $9 > 1e300' | head -n 2
CHR SNP BP A1 A2 FRQ_A_5956 FRQ_U_14927 INFO OR SE P Direction HetISqt HetPVa
1 rs193017277 165984647 T C 0.0081 0 0.613 199571406934815619801303686702426892954171102000409425009317586173629711759132743632343619870627981320256063608109903452715513572212491185332046423957413640313823842710929287814623383543019486815510138627343322014255627373009191126989038950002060088137249304762180566168009273002554121206876304756965376.00000 642.4 0.2786 ??????+ 0.0 1

Based on line counts for my filtered file, there are 10 SNPs with OR >= 1e300 in EUR.CD.gwas.assoc.gz (though I wouldn’t be surprised if there are additional SNPs with similarly extraordinary ORs somewhat < 1e300).

Cheers,
Raymond


Reply all
Reply to author
Forward
0 new messages