Can PLINK work on dosage data for LD pruning?

408 views
Skip to first unread message

Shuo Yang

unread,
Jul 21, 2015, 3:26:28 PM7/21/15
to plink2-users
I'm not very familiar with the software, but I need to do some LD pruning for imputed data, best on dosage data. Can I do that? Or if not, does that matter very much if the pruning works on the normal imputed genotype data and then SNP removal works on dosage data according to the results from pruning?


Thanks,
Shuo

Christopher Chang

unread,
Jul 21, 2015, 3:55:54 PM7/21/15
to plink2-users, sdmor...@gmail.com
Not yet, only a few PLINK operations currently work on dosage data and LD pruning is not one of them.  However, if your data was generated by IMPUTE2, PLINK can convert the data to hard calls (you can control how this conversion occurs with the --hard-call-threshold flag) and then produce good LD pruning results.

Shuo Yang

unread,
Jul 21, 2015, 3:59:04 PM7/21/15
to plink2-users
Thanks for the quick answer Christopher. That sounds very good.

Shuo

Shuo Yang

unread,
Jul 21, 2015, 4:17:42 PM7/21/15
to plink2-users
By the way, can I ask why LD pruning is not supported for dosage data? As I feel the dosage data is kind of the best for association study, and LD pruning is also necessary for association study, so LD pruning on dosage data should be a very important feature of the software. Am I correct? Or it's just a issue of the time and it will be supported in the near future? Or due to some technical or scientific problems it will not?


Thanks,
Shuo

On Tuesday, July 21, 2015 at 3:55:54 PM UTC-4, Christopher Chang wrote:

Christopher Chang

unread,
Jul 21, 2015, 4:35:22 PM7/21/15
to plink2-users, sdmor...@gmail.com
Dosage data is not supported by the PLINK 1 file format.  All of PLINK's dosage data commands are in a small module disconnected from the rest of the program.

This will be changed in PLINK 2.0, since yes, dosage data is a lot more common now than it was in 2007.

Shuo Yang

unread,
Jul 23, 2015, 11:38:23 AM7/23/15
to plink2-users, chrch...@gmail.com
Hey Christopher,

Finally I decided to prune the SNPs on dosage data by myself. However, I don't quite understand the pruning algorithm used in PLINK, specifically for the "moving the window" part. Let's say, if we move the window 5 SNPs each step, there are two possibilities: 1. the front part of the SNPs in the new window are from the last window, which means they have already been greedily checked and pruned, and there is no meaning to further check and prune them (redundant); 2. if the number of left SNPs in the last window is less than 5, then after we move forward, we jump and ignore some SNPs between the two windows, which we don't know whether we should prune them or not (non-comprehensive). Is there special meaning or purpose for "moving the window every x SNPs in each step"? I'm trying to understand how the algorithm greatly make sense. Or maybe we can just greedily check and prune all SNPs in order, ignoring this sliding window part, right?

Thanks,
Shuo

Christopher Chang

unread,
Jul 24, 2015, 12:36:41 AM7/24/15
to plink2-users, sdmor...@gmail.com, sdmor...@gmail.com
Suppose our initial window is of size 500.  Label the SNPs s_1, s_2, ..., s_{500}; --indep-pairwise performs the following operations on them:

removed_indices = {};
for (snp_index_1 := 1; snp_index_1 <= 499; snp_index_1 += 1) {
  if (snp_index_1 in removed_indices) {
    continue;
  }
  for (snp_index_2 := snp_index_1 + 1; snp_index_2 <= 500; snp_index_2 += 1) {
    r2 = compute_correlation(s_{snp_index_1}, s_{snp_index_2});
    if (r2 > max_allowed_r2) {
      if (MAF(s_{snp_index_1}) < MAF(s_{snp_index_2})) { // prune the SNP with lower minor allele frequency
        removed_indices.add(snp_index_1);
        break;
      } else {
        removed_indices.add(snp_index_2);
      }
    }
  }
}

Suppose every even-numbered SNP remains after this loop completes, and the window step size is 50.  Then we advance the end of the window from s_{500} to s_{550}.  s_1, s_3, ..., s_{499} are definitely out.  s_2, s_4, ..., s_{50} are definitely still in.  But s_{52} will still be checked against s_{501} to s_{550}.  And then s_{54} will still be checked against the subset of s_{501}..s_{550} that doesn't get knocked out by s_{52}.  Etcetera.

Nick Mancuso

unread,
May 8, 2017, 2:26:03 AM5/8/17
to plink2-users, sdmor...@gmail.com
Hello,

Thank you for the awesome new release of PLINK2. The "indep-pairwise" flag is implemented, but does it operate on the dosage values or the hardcalls?

If I use plink --vcf my_calls.vcf dosage=DS --indep-pairwise 250kb 0.3 

how does it run? I realize that documentation is still a work in progress while the tool is being developed. 

Cheers and thank you again,

Nick

Christopher Chang

unread,
May 8, 2017, 2:30:35 AM5/8/17
to plink2-users, sdmor...@gmail.com
Just the hardcalls.  (You can use --hard-call-threshold to customize when hardcalls are saved, though.)
Reply all
Reply to author
Forward
0 new messages