Harmonize EIGENSTRAT results with flashpca

Vince Forgetta

unread,

Apr 4, 2017, 2:04:37 PM4/4/17

to flashpca-users

Hi,

Thanks for developing flashpca. The time I have saved on just the few runs I have done so far is incredible.

My issue is reconciling the number of markers recommended by EIGENSTRAT vs flashpca.

EIGENSTRAT recommends that the number of markers input is ~100,000. See:

https://github.com/DReichLab/EIG/blob/master/EIGENSTRAT/README

Specifically this paragraph:

"RUNNING EIGENSTRAT WITH LESS THAN 100,000 MARKERS

The EIGENSTRAT method is primarily aimed at genome-scan data sets with

at least 100,000 markers. If running EIGENSTRAT on much smaller data sets,

the inclusion of a candidate marker in the set of markers used to infer

principal components may lead to a loss in power (Price et al. 2006 Supp Note)."

In contrast the example for flashpca on the github page tends to prune in considerably fewer (--indep-pairwise 1000 50 0.05), albeit you clearly state that this number of SNPs is acceptable.

I have performed a PC analysis using EIGENSTRAT on ~100,000 pruned SNPs (--exclude range high-LD-regions-hg19.txt --indep-pairwise 50 5 0.2). I would like to confirm that flashpca can recapitulate these results. When I attempt it on the same PLINK binary ped input file of ~100,000 markers I get highly correlated results for PC1 of 0.93 (albeit not 1), but no correlation for remaining PCs.

So question is:

What is best approach to harmonize results from past analysis with EIGENSTRAT using 100,000 markers and any current attempts I want to perform using flashpca? I want to ensure that whatever results I an getting from flashpca are "valid" in that they agree with my past results using EIGENSTRAT (assuming those are valid!).

Thanks,

Vince

Vince Forgetta

unread,

Apr 4, 2017, 2:09:04 PM4/4/17

to flashpca-users

oops. Forgot to mention required information:

$ uname -a

Linux d1p-hydrars03.ldi.lan 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/centos-release

CentOS Linux release 7.2.1511 (Core)

$ flashpca --version

[Tue Apr 4 14:08:10 2017] arguments: flashpca flashpca --version

flashpca 2.0

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Gad Abraham

unread,

Apr 4, 2017, 9:43:48 PM4/4/17

to Vince Forgetta, flashpca-users

Hi Vince,

You ran EIGENSOFT/smartpca on 100,000 SNPs and then FlashPCA on fewer
SNPs or on the same data?

Gad

> --
> You received this message because you are subscribed to the Google Groups
> "flashpca-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to flashpca-user...@googlegroups.com.
> To post to this group, send email to flashpc...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/flashpca-users/c9bbaee9-13a1-47f3-84e4-328c05ec75b6%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Vince Forgetta

unread,

Apr 6, 2017, 8:26:07 AM4/6/17

to flashpca-users

In discussion with Gad via email, we found that outlier removal while using EIGENSTRAT was the issue.

Setting the '-m 0' while running smartpca.pl or setting 'numoutlieriter: 0' in the parameter file (*.par) for smartpca resolved the issue.

In addition, the use of fastmode (fastmode: YES) disables outlier removal as well, and did produce results that where highly correlated to flashpca.

Moreover, to further confirm results are highly concordant, outliers were identical between the two programs (those with standard deviation > 6 for one or more PC).

Finally, the following from the flashpca paper (doi:10.1371/journal.pone.0093766) clearly state that outliers were not removed during comparison:

"smartpca was run without excluding potential population outliers."

Reply all

Reply to author

Forward