PCA projection bias (aka. shrinkage)

337 views
Skip to first unread message

Davidski

unread,
Jun 20, 2014, 6:12:35 AM6/20/14
to plink2...@googlegroups.com
I recently discovered that the PCA within Plink 1.9 suffers from projection bias or shrinkage. Here's an example using the ancient La Brana-1 genome.

Projected


Not projected


This problem also affects Eigenstrat, and is most pronounced when using a fairly small number of reference samples (under 5,000). Here's a technical discussion of the problem.


I have  no idea whether this is fixable, but I just thought I'd mention it. By the way, yes, I'd definitely like to see a more sophisticated IBD estimation option within Plink 1.9, especially one that takes into account population structure and admixture.

Cheers

James Lee

unread,
Jun 20, 2014, 1:31:53 PM6/20/14
to plink2...@googlegroups.com

It is well known that the PCA solution depends on the sample sizes of the respective populations, and I would guess that the transition from N=0 to N=1 can be particularly disruptive.

http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000686

The different appearance of the PC plots will not go away in this case even if the respective sample sizes in each of the two analyses are multiplied by a very large constant.

It seems to us that the Lee et al. paper addresses a qualitatively different problem. Suppose that the meta-population is fixed -- which means that the relative sizes of the constituent populations are fixed, and that the asymptotic PCA solution is fixed. Then the projection of a new individual from any of the constituent populations onto a solution derived from a finite sample of the meta-population is biased toward the origin. However, for a fixed number of SNPs, this problem does go away as the size of the meta-population sample used for training becomes large.

We will study the Lee et al. paper and consider whether its suggested adjustment is worth implementing in PLINK. However, we note that this adjustment may not bring your two plots into better agreement.

Others may chime in if they disagree with this diagnosis ...

Davidski

unread,
Jun 20, 2014, 10:31:56 PM6/20/14
to plink2...@googlegroups.com
Hi,

The problem on my plots is the same one as discussed in this paper...


Note that the authors implemented the solution from Lee et al. by calculating the shrinkage factor and compensating for it.

Basically, if you run a plot of Europe and project,say, a Russian onto that plot, the Russian will not cluster with Russians but with Germans, because his projected result will be biased towards 0. Ideally though, he should cluster with Russians, like he does when not projected.

As far as I know, the problem is caused by using a much higher number of markers than samples, and can be reduced by increasing the number of samples. But using the same number of samples as markers isn't practical these days, since many analyses are run with well over 100,000 SNPs.
Reply all
Reply to author
Forward
0 new messages