PCA vs PCoA

Mandi D'Ombrain

unread,

Nov 9, 2021, 7:32:18 PM11/9/21

to dartR

Hi,

Hoping someone can clear up my PCA / PCoA confusion. I have read the documentation for 'gl.pcoa' https://www.rdocumentation.org/packages/dartR/versions/1.9.9.1/topics/gl.pcoa (although I have seen posts about bugs and am not sure which dartR version I should be using).

My understanding is that although the function is called ‘gl.pcoa’, it is a wrapper for two separate functions:

If you give it a genlight object, it will use the function ‘glPca’ from ‘Adegenet’ and perform a Pearson Principal Component analysis (PCA), using a Euclidean distance.
If you give it a distance matrix (which can be created via multiple methods) it performs a Gower Principal Coordinate analysis (PCoA) using the function ‘pcoa’ from the ‘Ape’ package.

My data is DArT genlight SNP data and I have been using the following code:

pc <- gl.pcoa(gl)

gl.pcoa.plot(pc, gl)

Therefore I would think that a PCA is produced. My questions are (1) am I indeed performing a PCA, and (2) is this appropriate or should I instead be performing a PCoA?

Thanks in advance!

Kind regards,

Mandi

Arthur Georges

unread,

Nov 9, 2021, 10:33:07 PM11/9/21

to dartR

Hi Mandi,

The literature is confusing on the distinction between PCA and PCoA (and for that matter MDS).

Basically, PCA takes the original data, for which it is sensible to compute a covariance matrix, and uses that as the basis of the ordination (selecting a new basis in which the axes are ranked on order or variation "explained").

PCoA uses a trick on noting that the analysis does not need access to the raw data except to calculate the cavariance matrix, and so replaces the correlation matrix with any distance matrix (after some transformation) to generate the ordination. For the distances to be accurately represented in the ordinated space, the distance matrix should be Euclidean (else negative eigenvalues are generated), or at the very least a metric distance, but for many purposes, these conditions are not necessary. The resultant distortion in the ordination can be slight, and certainly less than occurs by only considering the top 2 or 3 axes in the ordination.

glPCA uses not the covariance matrix, but Euclidean distance, so depending on how you look at it, it is a PCA or it is a PCoA. Standardized Euclidean distance and the covariance matrix are pretty much the same thing.

With SNP data and gl.pcoa, you are essentially doing a PCA, not a PCoA, and the new version of dartR reflects this in the labelling of the axes. That is fine.

It is only if you wanted to represent genetic distances using some other measure of genetic dissimilarity that you would feed gl.pcoa with a distance matrix directly. You might for example want to feed it the FST's (non-metric), or a fixed difference matrix, or a matrix combining both the SNP genotypes and the SilicoDArT presence absence data, or add in some interval or ordinal data. Then you would use PCoA based on a distance matrix.

I hope that helps. You are doing a PCA.

Arthur

Mandi D'Ombrain

unread,

Nov 10, 2021, 11:44:57 PM11/10/21

to dartR

Hi Arthur,

Thank you so much for clearing this up for me, it is much appreciated. The literature is certainly confusing!