I was wondering if I could get your thoughts on an approach. I also realized that our past 2 responses were private and not available for the public to see, so I will summarize what we determined earlier.
As alluded to in the "Efficient Variant Set Mixed
Model Association Tests for Continuous and Binary Traits in Large-Scale
Whole-Genome Sequencing Studies" paper published in AJHG, a user should know that if they are analyzing a gene
with a majority of singletons/ultra rare variants with variance component based tests, such as SKAT, they should not expect a signal because the based SKAT
methods are calculating SNP specific statistics and then aggregating
them In this case, the Burden test is
preferred.
In a more detailed description, the SKAT
methods are variance tests that, for a binary trait, test against a random distribution of coin flips. For singletons, there are only 2
possible distributions: in the case or in the control. Therefore, for a collection of singletons in a gene, each individual SNP score statistic would be
weak and, in-turn, the aggregate score used for p-val calculation would also be weak.
This is regardless of how many singletons are present in the gene.
Burden tests, on the other hand, are mean based (aggregate the total
number of SNPs and test if there is a different in mean number of SNPs
between cases and controls) and therefore would be a more appropriate
test when looking at a gene or genes with an overwhelming majority of
ultra-rare or singleton variants.
New Question:
Given the behavior I described above, I was wondering your thoughts on EXCLUDING singletons (and possibly even doublets) when running analysis using variance component tests such as SKAT, SKAT-Binary, SKAT-Robust. The thought being that these singletons could actually artificially DEFLATE gene level p-values for genes where the input variants are majority ultra rare (i.e singleton, doublet)
Sincerely,
JFK