Clarification of how gl.dist.ind handles missing data

Nikki Vollmer

unread,

Sep 19, 2023, 5:45:36 PM9/19/23

to dartR

Hi,

I have been reading the Technical Note on Genetic Distance and was hoping someone could help to clarify something within it. I am trying to use non-DArtSeq produced SNPs (50K over 230 individuals) to run a PCoA (dartR v 2.0.4).

After reading the section on pg 35 "Further Notes on Managing Missing Data", when I started to try some of the code with my data, which do include NAs, I was running gl.impute (using frequency or HW) and then gl.dist.ind (=scaled Euclidean) and getting odd clustering (some coastal things were clustering with offshore things and that definitely shouldn't happen) when I did a subsequent PCoA. But, when I just run gl.dist.ind without first using gl.impute to replace NAs the PCoA clustering makes sense.

I also tried running gl.filter.allna and then gl.dist.ind and got clustering that made sense.

Admittedly I am not really sure why my clustering is getting so off what is expected when I do both gl.impute and gl.dist.ind. So going back and reading the "Impact of Missing Values" section on pg 26, now I am thinking when running gl.dist.ind on SNP data, it automatically removes all loci with missing data. So then there is no need to first run gl.impute....is that right?

If so, when would you use gl.impute before running a PCoA? Or am I totally misunderstanding what is happening?

Happy to provide more info if needed.

Thanks for any help, it is much appreciated!

Nikki

Arthur Georges

unread,

Sep 19, 2023, 6:12:37 PM9/19/23

to da...@googlegroups.com

Hi Nikki

Run

gl.set.verbosity(3)

gl <- gl. compliance.check(gl)

gl <- gl.filter.allna(gl)

gl <- gl.filter.callrate(gl, threshold=0.95)

gl <- gl.impute(gl)

pca <- gl.pcoa(gl)

gl.pcoa.plot(pca,gl)

See if you still have the problem and then get back to us.

Refer

https://doi.org/10.1101/2023.03.22.533737

which will replace the tech note on acceptance.

A

--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/60f87c88-efda-4306-8ba6-603541ce662cn%40googlegroups.com.

Nikki Vollmer

unread,

Sep 20, 2023, 6:30:04 PM9/20/23

to dartR

Thanks so much Arthur!

The code you provided passed the compliance check (although monomorphic loci are present) and each line successfully ran with no errors, except when I tried to plot the product of gl.filter.allna. The gl.filter.allna step took more than an hour to run and when doing the PCA produced the following message:

> pca <- gl.pcoa(dataNA)

Starting gl.pcoa
Processing genlight object with SNP data
Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state
Completed: gl.pcoa
Warning message:
In UseMethod("depth") :
no applicable method for 'depth' applied to an object of class "NULL"

However, the (very similar) plots produced from all 3 filtering methods did cluster as expected.

What I was previously doing was the following:
D <- gl.dist.ind(gl, method="Euclidean", scale=TRUE, plot.out=TRUE)
pc <- gl.pcoa(D)
PCoA <- gl.pcoa.plot(pc, gl, scale=FALSE, ellipse = FALSE, pop.labels = 'pop', interactive=TRUE)

Am I correct in that the code you provided was performing a replacement for NA's (in 3 different ways) and then doing a PCA?

And the code with gl.dist.ind is creating a distance matrix using Euclidean distances and performing a PCoA based on that distance matrix?

Thanks again!!

Nikki

peter...@unmack.net

unread,

Sep 20, 2023, 8:36:16 PM9/20/23

to da...@googlegroups.com

G'day Nikki

The reason the pcoa takes so long is the number of loci you have. Once
you get over around 10,000 the time it takes to run gets much longer.
In most cases 500 high quality snps will give you the same story as
50,000 loci. That depends a lot on your question of course, but for
looking for spatial patterns the number of loci isn't very important.
When doing data exploration I always aim to have no more than around
5000 loci to save time.

Cheers
Peter

> https://groups.google.com/d/msgid/dartr/60f87c88-efda-4306-8ba6-603541ce662cn%40googlegroups.com <https://groups.google.com/d/msgid/dartr/60f87c88-efda-4306-8ba6-603541ce662cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "dartR" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dartr+un...@googlegroups.com

> <mailto:dartr+un...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/dartr/5c0ae38d-16a5-40c7-86ef-f4ecb24ed4f0n%40googlegroups.com <https://groups.google.com/d/msgid/dartr/5c0ae38d-16a5-40c7-86ef-f4ecb24ed4f0n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Nikki Vollmer

unread,

Sep 22, 2023, 5:00:24 PM9/22/23

to dartR

Thank you Peter!

So I did figure out why the general clustering patterns were inconsistent.....I wasn't using the input file I thought I was (insert a slapping forehead emoji), and while I was telling the program I had 237 individuals, my input file actually had 242. Once I fixed that issue all makes sense now and general clustering is the way it should be across methods.

But I have to say trying all the different things suggested by Arthur helped me to mine the results and better understand what should be happening so I could narrow down where the error was occurring.

Thanks again for your time and thoughts!

Nikki :)

Arthur Georges

unread,

Sep 22, 2023, 9:40:50 PM9/22/23

to da...@googlegroups.com

Hi Nikki -- glad you are sorting it out. I had a look at the gl.filter.allna script and have improved its efficiency so next time you should not have to wait an hour for it to process. This will become available in the next release.

I see that you initially took the approach of constructing a distance matrix and then passing this to gl.pcoa. In other words, you did a PCoA. The script I gave you did a PCA. When using Euclidean Distance for the PCoA and (implicitly) the covariance matrix for PCA, the outcomes are essentially equivalent. The only difference will be in the % explained values on the axes, for reasons explained in the preprint I recommended you read. So all good on that front.

The script I recommended had a couple of elements in it. The first was to filter fairly heavily on call rate to increase the data density. This reduces the level of imputation required to generate a fully populated data matrix required by classical PCA. I recommend using nearest neighbour for the imputation (the default).

PCA requires global absence of missing values, hence the need for imputation of some sort. adegenet::glPCA forms the kernal of gl.pcoa, but it uses imputation based on substitution from the global means, and so can cause distortions (individuals with large numbers of missing values will be drawn toward the origin which can be very misleading).

PCoA is more tolerant of missing values because the distances in the input distance matrix are calculated pairwise. But they are still problematic for a number of reasons (e.g. an NA is dropped from the distance calculation, so that locus contributes zero to the distance), so best to impute also with PCoA if you plan to go down that route.

Let us know if you have any further issues.

Arthur

To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/8a97cfd8-e131-4114-a4be-35981fc364efn%40googlegroups.com.

Reply all

Reply to author

Forward