Hi Chris (and everyone!),
I'm using Plink 1.9 to run an association test of a case-control data set comprised of ~ 1 Million SNPs derived from ~ 200 North American bat whole-exome sequences. The phenotype is binary - the populations consist of individuals that either died of a disease or survived. The SNPs were filtered for read depth per site (10x per individ required), missingness (no less than 30% per group), and minor allele frequency (no less than 0.05 for the entire dataset).
The goal is to identify SNPs that are associated with surviving the disease.
As a first pass, I ran a pair of tests using plink --assoc with both permutation options (plink --assoc perm first, and then plink --assoc mperm=250000 next). In the case of the adaptive permutation command, there were about 1,000 SNPs identified with an EMP1 less than 0.05. However, in the case of the max(T) option, there were many more SNPs with EMP1 <= 0.05 (about 75,000), yet zero of these had any adjusted pvalue less than 0.05 (in fact, only 2 SNPs had values less than 0.1).
It's my understanding from
this thorough documentation that the value of the EMP1 field is dependent upon the number of permutations applied to each SNP. For example, say I ran
--assoc mperm=2000 and the first SNP in that dataset had an EMP1 value of 0.025. If it turns out (by chance) that the same SNP had exactly 2000 permutation tests when I ran
--assoc perm, I should expect the same pvalue for the EMP1 field, right? It's my understanding that the Max(T) values would not need further correction for multiple testing, but the EMP1 values from the adaptive permutation command likely do not reflect the same stringency (apologies if there is a better word to use here). I'm guessing users typically proceed without corrections for the Max(T) approach, but what do users to for the adaptive approach?
In brief:
1. Is there anything I should be considering when running Max(T) that may produce so few (or none!) significant p-values? Typically it's things like hidden population genetic structure that inflates values. In my case, I'm wondering if there was something I'm not considering that is deflating these values. I've tried running 2k, 20k, 50k permutations in addition to the 250k test - I get the same SNPs with the same significance values each time. Perhaps I should further increase the proportion of missingness? Increase the MAF filter?
2. Is there any value in using the adaptive permutation method if the Max(T) report suggests that so few SNPs are likely significant? It appears that one option might be to explore the plink --clump command in the output of the adaptive permutation .assoc file. If that is the case, I'm still curious to find out if there are additional measures I should take to control for false positives.
Finally, I have also explored using the plink --logistic --covar command to incorporate the top few PC covariates into a regression analysis. Notably these PCs explain little variation (the first 3 eigenvalues are just 2.8, 2.0, and 1.7, respectively). That was expected because this bat species is essentially panmictic east of the Rockies. I'm awaiting the regression output but was wondering whether the same considerations for multiple testing corrections apply to this method, and if not, what recommendations users might have.
Thank you for your help, I look forward to your response.