Missing Data in PCoA

490 views

Skip to first unread message

Dallas

unread,

Nov 22, 2021, 1:31:48 PM11/22/21

to dartR

Hi there,

Does anyone know if the gl.pcoa function takes into account missing data? Does it ignore null calls, or does it read those as diversity within the dataset?

Thanks,
Dallas

Arthur Georges

unread,

Nov 22, 2021, 3:16:15 PM11/22/21

to da...@googlegroups.com

Hi Dallas,

PCA does not like missing data. The algorithm as initially developed by Pearson expects a complete dataset.

Algorithms treat missing data in different ways. As I understand it, a common approach is to replace the missing element with the central value which, because PCA has no advance knowledge of group membership, can be problematic if the objective of the analysis is to identify groupings. Only matters where missing data is fairly extreme. An alternative with SNPs is to replace the missing data with a random selection of the two possible SNP states. This throws in an additional element of noise which would not distort the outcome. I am not sure what {adegenet} glPCA does. It is at the heart of our script. I expect it is something along the lines of the latter.

The distances computed for PCA are Euclidean, which means they can be represented in a rigid (cartesian) space without distortion provided the ordination has the full set of dimensions. I believe missing data depending on how they are handled, can destroy the Euclidean property, and can therefore result in distortion and even negative eigenvalues. But again, the level of missing data would need to be extreme to cause this.

With PCoA, where the starting point is a distance matrix, it will depend on how the distance measure accommodates missing values. You do not get something for nothing, so a distance matrix calculated with a Euclidean distance from data with missing values will likely no longer satisfy the Euclidean properties, if that is important.

Missing values in the distance matrix itself should be avoided, as PCOA does not access the original SNP data and so needs to fill in the missing value using some central value approach.

If I had a magic wand, I would take advantage of my prior knowledge of populations that contributed to the dataset, assume HWE for each population (sampling site), and replace the missing values by substituting a SNP score (0, 1 or 2) with a value consistent with expectation under HWE. That will add random noise to the dataset which, if extreme for any one locus, will push its contribution down to lower dimensions in the ordination and not influence the outcome in the one, two or three dimensions of the final visual outcome.

Perhaps this is something we could add to version 2 of dartR after it is released shortly?

I should add that I am not an expert in this and if someone with a deeper mathematical knowledge would like to contribute that would be great. Even if you contradict my interpretation, I won't be offended.

Hope this helps Dallas.

Arthur

--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/df78bfb0-9032-4fb1-9339-7459455332c5n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages