Hi,
Thanks for developing flashpca. The time I have saved on just the few runs I have done so far is incredible.
My issue is reconciling the number of markers recommended by EIGENSTRAT vs flashpca.
EIGENSTRAT recommends that the number of markers input is ~100,000. See:
Specifically this paragraph:
"RUNNING EIGENSTRAT WITH LESS THAN 100,000 MARKERS
The EIGENSTRAT method is primarily aimed at genome-scan data sets with
at least 100,000 markers. If running EIGENSTRAT on much smaller data sets,
the inclusion of a candidate marker in the set of markers used to infer
principal components may lead to a loss in power (Price et al. 2006 Supp Note)."
In contrast the example for flashpca on the github page tends to prune in considerably fewer (--indep-pairwise 1000 50 0.05), albeit you clearly state that this number of SNPs is acceptable.
I have performed a PC analysis using EIGENSTRAT on ~100,000 pruned SNPs (--exclude range high-LD-regions-hg19.txt --indep-pairwise 50 5 0.2). I would like to confirm that flashpca can recapitulate these results. When I attempt it on the same PLINK binary ped input file of ~100,000 markers I get highly correlated results for PC1 of 0.93 (albeit not 1), but no correlation for remaining PCs.
So question is:
What is best approach to harmonize results from past analysis with EIGENSTRAT using 100,000 markers and any current attempts I want to perform using flashpca? I want to ensure that whatever results I an getting from flashpca are "valid" in that they agree with my past results using EIGENSTRAT (assuming those are valid!).
Thanks,
Vince