Choosing n.pca and n.da in dapc()

681 views
Skip to first unread message

Konrad Taube

unread,
Apr 13, 2021, 12:28:31 AM4/13/21
to poppr
Hi all,

As the title says, I'm working with the dapc() function in RStudio and am struggling a little bit to understand n.pca and n.da in the function. I've read a few guides online, and honestly the numbers the authors use each time really confuses the heck out of me (which is my fault, not theirs!). I guess I have one simple question: how in the world do you find the best values for n.pca and n.da?

In my glPca() function, I have an nf of 5 for my samples (I have SNP data of 55 individuals across 5 countries), and am wondering if that is part of what is used for those two variables? It's probably obvious (lol it for sure is), but I'm pretty new to this stuff, but I am trying!

Thanks so much,

Zhian Kamvar

unread,
Apr 26, 2021, 7:32:25 PM4/26/21
to poppr
Hello,

The confusion is COMPLETELY understandable (especially for documentation that was written before I was in grad school). The values of n.pca and n.da MUST be obtained through exploratory analysis (e.g. leave the n.pca and n.da arguments blank the first time). The key is that you are creating a model of your data and you want to balance the signal and the noise.

The number of PCs are determined by the number of alleles in a data set. The more alleles, the more PCs you have to choose from. If you choose all of them, then your discriminant analysis will perfectly fit the posterior group assignments to your data, which is a bit like drawing a line that connects all the dots in a linear regression. To help choose the right number, you can use cross-validation: https://grunwaldlab.github.io/Population_Genetics_in_R/DAPC.html#cross-validation-dapc-analysis-of-phytophthora-ramorum-from-forests-and-nurseries

The justification for choosing the number of DA is similar: you want a model to represent your grouping data, but you don't want it to overfit your data. This is where the DA graph comes in (during exploratory analysis). The higher the bars on the graph, the more variance they explain. The lower bars often represent more noise than anything. This becomes important when you want to create one of those structure-like plots with compoplot(), because that uses all the information from the DA to create that graph. If you over-fit, then you are bundling the noise in your result.

I hope that helps.

All the best,
Zhian

Konrad Taube

unread,
May 1, 2021, 3:12:23 PM5/1/21
to poppr
Hi Zhian,

Thanks so much for your reply. I have a couple of questions now! So the first step in that guide you linked is to give xvalDapc() a go, but I am unsure of what the actual input is? I see in the tutorial it is 'nancycats' but for real data what would you suggest? 

I also tried a dapc() command with `NULL` for n.pca and n.da, and made the following scatter plot with the code: scatter(bbel1.dapc, col = cols, cex = 2, legend = TRUE, clabel = F, posi.leg = "bottomleft", scree.pca = TRUE, posi.pca = "topleft", cleg = 0.75)

Capture.PNG

So would I call that the result of my "exploratory analysis"? It gave me some PCA values to look at an DA eingenvalues, and I'd like to keep going with my analyses. I feel like I might be pretty close, but I'd love to get something like that cross-validation plot done for the data to confirm it. Thanks so much again!!
Reply all
Reply to author
Forward
0 new messages