Calculating Chi-Square statistic for Linkage Disequilibrium test

450 views
Skip to first unread message

Lance Parsons

unread,
Oct 23, 2017, 4:40:15 PM10/23/17
to plink2...@googlegroups.com, Simon Forsberg

I’m working with some researchers who have been using PLINK to calculate LD (r2) for a large number of individuals and many SNP combinations rather successfully. However, for their specific case, they would like to calculate chi-square instead of the d-prime statistic.

An example of what they are doing for r2 calculations with PLINK:

plink --bfile data --r2 dprime inter-chr --ld-snp-list snplist.txt --threads 10 --ld-window-r2 0 --out ldResults

Using standard statistics tools makes this simple, but no where near fast enough (not close to PLINK’s speed). So, they asked me to help them. I thought the best place to start was to modify PLINK so it could calculate chi-square since a lot of the potentially slow data IO code is already handled.

Does anyone have any advice as to how to best go about implementing this (i.e. a place to start in the code base, etc.)? Any pointers or suggestions would be much appreciated. Of course, let me know if any further details or explanation would be helpful. I’m happy to contribute any code changes to the project if they are wanted.

-- 
Lance Parsons - Scientific Programmer
Carl C. Icahn Laboratory - Room 136
Lewis-Sigler Institute for Integrative Genomics
Princeton University

Christopher Chang

unread,
Oct 23, 2017, 5:05:45 PM10/23/17
to plink2-users
Calculation of the r/r^2 and D/D' statistics currently happens in lines 5188-5210 of plink_ld.c .  As long as you don't need both the chi-square stat and D/D' to be reported, it's straightforward to add another branch there to save the chi-square stat where D/D' would usually go; then there's a bit of command-line parsing code to add, another branch when writing the header line, etc., but it's still small enough that I'm happy to take care of this tonight.

Lance Parsons

unread,
Oct 24, 2017, 8:57:43 AM10/24/17
to plink2-users
Wow, that would be fantastic Christopher! At the moment, I don't believe there is any need to report both statistics. Please let us know if we can help out in any way (coding, testing, etc.).

Lance

Lance Parsons

unread,
Oct 24, 2017, 4:28:44 PM10/24/17
to plink2-users
Thanks for the pointers Chris. I know you said you'd work on it, but I thought I'd take a look at that code myself. So far, I was able to figure out at least parts of it. I can get the counts for the 3x3 contingency table, but there are other branches of the code that I'm unsure of, so I'm not sure how general a case I'd be able to code. Nonetheless, it's a start.

Another part that I'm unsure of the implementation of chi-square calculations. It seems there are some chi-square functions already in the codebase, but again, I'm not sure which are appropriate to this case. I could certainly calculate it "myself", but I'm guessing that you have some optimized code and methods you'd rather use.

Christopher Chang

unread,
Oct 24, 2017, 8:10:20 PM10/24/17
to plink2-users
I assume you're referring to the chi-square test p-value? (As opposed to the chi-square statistic, which I'd just calculate directly.) In plink 1.9, chi-square statistics should be saved to the main data array, and only converted to p-values by the thread-unsafe chiprob_p() function (requires '#include "plink_stats.h"') when the output file is written.

(plink 2.0 has a better thread-safe implementation of chiprob_p().)

Lance Parsons

unread,
Oct 26, 2017, 2:54:43 PM10/26/17
to plink2-users

Excellent, thanks for the pointers. I just put a very simple implementation up on my ld_chisquare branch of my fork. Right now it calculates chi-square instead of dprime and outputs the chi-square stat instead of the dprime stat in ldResults.ld. Also, if one snp is missing a genotype completely, you’ll get an nan value. Finally, there is no accounting for sex chromosomes in the code, which perhaps it should.

I’d love to get this to the state where it can be a Pull Request to PLINK, but could use a bit of help with command line parameters, handling of edge cases, and, of course, code style and optimization. Would you be willing/able to help out?

Also, I’m happy to move this discussion over to plink-dev if that would be more appropriate.

Christopher Chang

unread,
Oct 31, 2017, 2:25:50 AM10/31/17
to plink2-users
Sorry about the delay in responding; I'll try to fill in the other pieces (command-line parsing, etc.) later this week.

Lance Parsons

unread,
Oct 31, 2017, 11:03:35 AM10/31/17
to Christopher Chang, plink2-users
No problem at all. I appreciate the help! I believe I interpreted the data structure correctly, at least in some cases, but I'm not sure in all, there were some values in some arrays I wasn't sure about. (e.g. counts[9-17]).

One "issue" I realized I didn't consider was what to do when one allele simply doesn't occur for a particular SNP. In other words, the contingency table has one row or column that is all zeros (0). When that happens, the calculating the expected results in a divide by zero (which is handled gracefully) and generates a `nan` result for chi-square. The question is, what is the appropriate thing to do in such a case.

Again, thanks for the help. Let me know if you'd like me to start a PR or something (we can continue discussion there). Whatever makes it easier.

-- Lance

On Tue, Oct 31, 2017 at 2:25 AM Christopher Chang <chrch...@gmail.com> wrote:
Sorry about the delay in responding; I'll try to fill in the other pieces (command-line parsing, etc.) later this week.

--
You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/lME6ld4i4cQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
 - Lance Parsons

Christopher Chang

unread,
Nov 3, 2017, 10:08:00 PM11/3/17
to plink2-users
counts[9-17] is for male-only counts when at least one variant is on chrX.

Since you are performing the chi-square test on a 3x3 table, it should be a 4df test when no rows or columns are empty.  However, if there's an empty row and/or column, you compute a 2x2 or 2x3 chi-square statistic and perform a 1df or 2df test instead.  In order to do this properly with the plink 1.9 codebase, you'd need to track the correct df in some manner; lines 3719-3723 illustrate one way to do this (store it in the bottom two bits of the chi-square value).  Alternatively, you could just save "nan" in those cases, but that can be problematic when low-MAF variants are involved.

The plink 2.0 codebase is better-suited for this computation since the command-line interface makes it straightforward for the user to specify which of {usually-3x3 chi-square stat, df, p-value} they want to see in the report.  (It'll also support an observation-count column, which lets you convert r^2 into a 2x2 allelic chi-square statistic.)  So I'll plan on adding it there when --r/--r2 is implemented within the next few months, instead of bolting it onto 1.9.  Let me know if you have any more questions about getting your 1.9 fork to do what your team needs right now, though.

Lance Parsons

unread,
Nov 29, 2017, 5:03:00 PM11/29/17
to plink2-users

In case you are going to implement this in 2.0, I’ve updated my fork of where I hacked 1.9 to output chisquare and pvalues instead of r2 and dprime. In addition, I added --ld-window-cs as a chisquare pvalue threshold to keep the output size down. You can see it all here: https://github.com/lparsons/plink-ng/tree/ld_chisquare

Lance Parsons

unread,
Dec 21, 2017, 2:53:45 PM12/21/17
to plink2-users
So I've managed to figure out the case with the reduced table (thanks for the pointer on encoding the df in the lower bits). However, I've got a question about the sex chromosome cases. Basically, I have to cases:

1. If either chromosome is Y, I need to use only males. This seems straightforward by using counts[9-17]. Is there some variable(s) I can/should check to know that I'm in this case?

2. If either chromosome is X, I need to calculate two value values, one for only males and one for only females. How is the handled now in the code? How is the output encoded?

Thanks for any pointers, I appreciate it. Also, let me know if/when you are working on this code in the 2.0 branch. I'd love to get this "officially" supported.
Reply all
Reply to author
Forward
0 new messages