interpreting differences between mperm and perm results

385 views
Skip to first unread message

Devon O'Rourke

unread,
Jun 10, 2020, 4:12:55 PM6/10/20
to plink2-users
Hi Chris (and everyone!),

I'm using Plink 1.9 to run an association test of a case-control data set comprised of ~ 1 Million SNPs derived from ~ 200 North American bat whole-exome sequences. The phenotype is binary - the populations consist of individuals that either died of a disease or survived. The SNPs were filtered for read depth per site (10x per individ required), missingness (no less than 30% per group), and minor allele frequency (no less than 0.05 for the entire dataset).

The goal is to identify SNPs that are associated with surviving the disease.

As a first pass, I ran a pair of tests using plink --assoc with both permutation options (plink --assoc perm first, and then plink --assoc mperm=250000 next).  In the case of the adaptive permutation command, there were about 1,000 SNPs identified with an EMP1 less than 0.05. However, in the case of the max(T) option, there were many more SNPs with EMP1 <= 0.05 (about 75,000), yet zero of these had any adjusted pvalue less than 0.05 (in fact, only 2 SNPs had values less than 0.1).

It's my understanding from this thorough documentation that the value of the EMP1 field is dependent upon the number of permutations applied to each SNP. For example, say I ran --assoc mperm=2000 and the first SNP in that dataset had an EMP1 value of 0.025. If it turns out (by chance) that the same SNP had exactly 2000 permutation tests when I ran --assoc perm, I should expect the same pvalue for the EMP1 field, right? It's my understanding that the Max(T) values would not need further correction for multiple testing, but the EMP1 values from the adaptive permutation command likely do not reflect the same stringency (apologies if there is a better word to use here). I'm guessing users typically proceed without corrections for the Max(T) approach, but what do users to for the adaptive approach?

In brief:
1. Is there anything I should be considering when running Max(T) that may produce so few (or none!) significant p-values? Typically it's things like hidden population genetic structure that inflates values. In my case, I'm wondering if there was something I'm not considering that is deflating these values. I've tried running 2k, 20k, 50k permutations in addition to the 250k test - I get the same SNPs with the same significance values each time. Perhaps I should further increase the proportion of missingness? Increase the MAF filter?

2. Is there any value in using the adaptive permutation method if the Max(T) report suggests that so few SNPs are likely significant? It appears that one option might be to explore the plink --clump command in the output of the adaptive permutation .assoc file. If that is the case, I'm still curious to find out if there are additional measures I should take to control for false positives.

Finally, I have also explored using the plink --logistic --covar command to incorporate the top few PC covariates into a regression analysis. Notably these PCs explain little variation (the first 3 eigenvalues are just 2.8, 2.0, and 1.7, respectively). That was expected because this bat species is essentially panmictic east of the Rockies. I'm awaiting the regression output but was wondering whether the same considerations for multiple testing corrections apply to this method, and if not, what recommendations users might have.

Thank you for your help, I look forward to your response.

Christopher Chang

unread,
Jun 10, 2020, 5:04:09 PM6/10/20
to plink2-users
1. With only ~200 samples, you don't have great statistical power.  If there's a SNP with a consistent large effect, you can probably still find it, but otherwise this is not a surprising outcome.  (The SNPs which already have a max(T) p-value of 0.1 are pretty likely to pan out once you add more samples, of course.)

2. Adaptive permutation rarely adds much value over plain --linear/--logistic.  Correction for top PCs is far more likely to be relevant.

3. The main drawback of --logistic is that full-blown max(T) permutation is much more expensive.  But with only 200 samples, a few hundred permutations should be manageable, and it sounds like that's enough for your purposes.

Devon O'Rourke

unread,
Jun 10, 2020, 6:13:08 PM6/10/20
to plink2-users
Thanks for the help Chris,
Two brief clarification questions:

1. Would your recommend any particular correction approach following the adaptive permutation method? Would this entail the --clump method, or something else?

2. With respect to the Max(T) correction, is there a standard way to identify an appropriate number of permutations for a given dataset? For example, maybe one might initially run the adaptive method, filter by some pvalue threshold (say, retain only those with < 0.001) and identify the median number of permutations required)?

Thanks for the insights, the terrific software, and the extensive documentation. It's a rare combo, and much appreciated.

Elielson Veloso

unread,
May 30, 2022, 6:06:14 PM5/30/22
to plink2-users
"It's my understanding that the Max(T) values would not need further correction for multiple testing" 

Hello, Devon! I also have the same doubt regarding EMP1 and EMP2 values... Can I consider that EMP1 p value is already corrected for multiple testing?

Thanks!
Reply all
Reply to author
Forward
0 new messages