Plink2 pca results might contain a bug?

272 views
Skip to first unread message

Alex

unread,
Jun 20, 2018, 7:41:45 AM6/20/18
to plink2-users
Hi there, Thanks ahead for your help!

I am using plink2's --pca 10  command to calculate projections of a subset of the 1000genomes SNPs.
  1. The calculation takes about 10x as long as plink1.9 when using plink2 (constructing the GRM) - I guess the algorithm switched from implementing the eigensoft one?
  2. More critically, however, plink2 gives me very unexpected results (see below), with parts of the resulting dataset seemingly offset and rescaled.
Unexpected results in --pca in plink2

For reference, PC1 and PC2 of plink1.9 (this is what i see with other tools too, e.g. ldak, eigensoft,..), with:
plink --bfile input --pca 10 header tabs --out plink19


PC1 and PC2 of plink2: notice the very similarly shaped inset of datapoints
plink2 --bfile input --pca 10 --out plink2


Looking at higher order PCs of the plink2 projections, the PC1/PC3 projection looks reminiscient of the plink1.9 projection, but accross all plots I get this weird offset, with parts of the data displaced and rescaled.




Christopher Chang

unread,
Jun 20, 2018, 8:45:38 AM6/20/18
to plink2-users
Hi,

Can you post the full .log file from your run? If you can send a dataset I can use to replicate this, or describe how I’d create it from the initial 1000 Genomes dataset, that’d also be great.

(Meanwhile, it’s “—pca approx”, not plain PCA, which invokes the EIGENSOFT 6 fastmode algorithm. That doesn’t really pay off until you get to >5000 samples.)

Alex

unread,
Jun 20, 2018, 9:33:17 AM6/20/18
to plink2-users
Hi Christopher,

Can you post the full .log file from your run?

Attached.

If you can send a dataset I can use to replicate this, or describe how I’d create it from the initial 1000 Genomes dataset, that’d also be great.

I can share, but I would prefer to send you a link to the dataset by email - its not too large but there are some sharing issues.
I'll write in a separate reply.

(Meanwhile, it’s “—pca approx”, not plain PCA, which invokes the EIGENSOFT 6 fastmode algorithm.  That doesn’t really pay off until you get to >5000 samples.)

Is the plain PCA still available in plink2? Even if I call --pca (without approx) I get "Constructing GRM" which takes way longer. 
output.log

Christopher Chang

unread,
Jun 20, 2018, 11:20:58 AM6/20/18
to plink2-users
Thanks for reporting this; looks like I broke PCA in the 30 May build.  Will try to post a fix within an hour.

Alex

unread,
Jun 20, 2018, 11:23:12 AM6/20/18
to plink2-users
Great (not the breaking haha), thanks so much for the quick fix!
I'm happy I could help!

Christopher Chang

unread,
Jun 20, 2018, 12:14:59 PM6/20/18
to plink2-users
Bugfix is now posted.  The bug also affected --score, so anyone who used the 30 May or an early June build for --score should also rerun with the newest build.

I will try to cover these commands with automated tests before the end of the week.  This wasn't done earlier since the straightforward testing strategy of checking for identical output files doesn't work here: PC signs depend on linear algebra library quirks, and "--pca approx" in particular is expected to have significant floating-point error.  But we're getting close enough to beta testing that there's no excuse to postpone this sort of basic quality assurance any longer.
Reply all
Reply to author
Forward
0 new messages