Interpretation of a PCA

29 views
Skip to first unread message

Molly Bloomfield

unread,
May 26, 2024, 11:48:31 PMMay 26
to dartR
Hi everyone,

I'm having some trouble with a PCA I produced from my DArTseq SNP data. 

Two of the clusters make sense to me, but the third is confusing. The cluster itself has an arch shape, and there are species in there that I wouldn't expect to be so closely related to the majority of the cluster. Should I be applying a transformation on the data to eliminate the arch shape, or is that a normal thing to see? As for the 'outgroups', is it just that their differences are explained in other pincipal components?

Thanks in advance to anyone who can help me!

Molly
PCA_gl_lepte_filtered_CR1_axis1and2.pdf

Stephen Opiyo

unread,
May 27, 2024, 8:21:58 PMMay 27
to da...@googlegroups.com
Hello Molly, 

Could you share a 3D PCA plot with us? It seems like there are four clusters in the data. The third cluster, which you described as an arc, appears to have two subclusters: one with positive PC1 and positive PC2 values, and the other with negative PC1 and positive PC2 values. A 3D PCA plot might clearly separate these four clusters since it will capture more variation in the data.

Kind regards,

Stephen 
Sent from my iPhone

On May 26, 2024, at 11:48 PM, Molly Bloomfield <molly.blo...@gmail.com> wrote:

Hi everyone,
--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/b1d26024-cc0b-4495-80a4-7ad10cec805en%40googlegroups.com.
<PCA_gl_lepte_filtered_CR1_axis1and2.pdf>

Stephen Opiyo

unread,
May 27, 2024, 8:22:10 PMMay 27
to da...@googlegroups.com
Hello Molly,

Thank you for sharing your results and asking questions.  Your PCA analysis separated the species into three distinct clusters, with PC1 and PC2 together accounting for only 35.1% of the total variation. I recommend plotting PC1, PC2, and PC3 in a 3D plot to better visualize how the species cluster.

Despite PC1 and PC2 explaining only a portion of the variation, the species form three clearly distinct clusters:

  1. Species in Cluster 1 are represented by positive PC1 and negative PC2 values.
  2. Species in Cluster 2 are represented by negative PC1 and negative PC2 values.
  3. Species in Cluster 3 are represented by positive PC2 values only.

Could you please clarify your question about the arc shape? I am not certain if transforming the data will affect your results, but it could be worth exploring further.

Kind regards,

Stephen 

Please compare 
Sent from my iPhone

On May 26, 2024, at 11:48 PM, Molly Bloomfield <molly.blo...@gmail.com> wrote:

Hi everyone,
--

Arthur Georges

unread,
May 27, 2024, 8:55:33 PMMay 27
to dartR
Hi Molly,

Transformations are typically used in ecological studies to remove the "horseshoe effect". This arises because population attributes at either end of an environmental gradient tend to be similar and so those ends are drawn together to form a horseshoe effect. It is hard to imagine how a similar process could apply to SNP data. So I agree with Stephen that transforming is unlikely to assist here.

A three dimensional plot is a good idea, and will give you some additional insight. Indeed, you should examine the scree plot to see how many dimensions are informative and which are just noise. Then, among the informative axes, you need to decide how far to delve down into deeper dimensions. A rule of thumb is to consider axes greater than 10% of variation explained (hence the horizontal line in dartR's scree plot).

That said, if there are three groups as your data suggest, they can be represented at a gross level in just two dimensions, so looking deeper will reveal departures from those groupings but probably not the gross story depicted in the two dimensions.

Without the context is is hard to advise further. However, one possibility is that there are two "clines" emanating from a source (or a "cline" disrupted by a barrier), contemporary or historical. Gaps in sampling arising from sampling limitations, or because intervening populations no longer exist, could be an issue obscuring the underlying process. If there were three discrete groups, I would not be suggesting this, but the arch you identify makes it a possibility. Have a look at the PCA I have attached. It represents a cline that is interrupted by a partial barrier (dispersal constriction at Brisbane) which leads to divergence patterns heading in different directions (an arch).

Without context, not saying this is an explanation for you, but it might be.

In any case, data talks and the pattern you see is crying out for a biological (popgen) explanation. Trying to make it go away with transformation is probably not the answer.

Good luck with it.

Arthur
Doc1.docx

Molly Bloomfield

unread,
May 27, 2024, 9:36:30 PMMay 27
to dartR
Hi all,

Thank you all very much for your suggestions and comments. I have looked at PC1 v PC3 and PC2 v PC3 as well, please find attached. 

I've also produced two phylogenies from the same data, and the populations falling in to the positive PC1 and positive PC2 quadrant are mostly 'outgroups' for those with negative PC1 values, excepting Recherche (RECH) which goes with the other negative PC1 populations. Analysis with NewHybrids is suggesting that Recherche and Strahan populations may represent hybrids between my main species and another, closely related species, represented in the PCAs as GWT. 

The group from NZ is also phylogenetically distinct in my IQ-TREE and SVDquartets, so that distinction makes sense (Nelson, Tararua, Ruahine, Whakapapa, Auckland, Canterbury and Chatham Islands). 

I'm unsure of a geographic barrier between Temma and Strahan, however. These are more geographically close than Strahan and Recherche, but Strahan and Recherche appear more closely related in all of my analyses. I will have to do more reading! Thanks again, this has been very helpful.

Kind regards,
Molly
PCA_gl_lepte_filtered_CR1_axis1and3.pdf
PCA_gl_lepte_filtered_CR1_axis2and3.pdf

Peter Unmack

unread,
May 27, 2024, 9:56:03 PMMay 27
to da...@googlegroups.com
Just to add to what Arthur said, often if you have a diverse dataset
then pca on the whole dataset may not be overly informative. Things
like outgroups will often have small sample size and more missing data
and should likely be excluded from pca. It's not really clear though
what the taxonomic breadth is for your dataset though.

I find it helpful to look at phylogenetic trees and pca plots. Often,
deeper splits in the trees are not well represented in pca plots, but
divergences within lineages are often more informative and similar
between both methods.

NewHybrids will only detect mixing in the last 2-4 generations, not
older events. Note that if you do have higher levels of introgression
then those individuals should be excluded from phylogenetic analysis as
they will distort your results (and it violates a key assumption of the
phylogenetic method). It depends a bit at what taxonomic levels you are
dealing with though.

Cheers
Peter
> --
> You received this message because you are subscribed to the Google
> Groups "dartR" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dartr+un...@googlegroups.com
> <mailto:dartr+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dartr/1ff3b856-1bfc-4470-8616-3678a5e0c7e9n%40googlegroups.com <https://groups.google.com/d/msgid/dartr/1ff3b856-1bfc-4470-8616-3678a5e0c7e9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

molly.blo...@gmail.com

unread,
May 27, 2024, 10:06:44 PMMay 27
to da...@googlegroups.com
Thanks Peter. I ran the PCA with the outgroups excluded, and it worked much better!

Good to know about phylogenies and hybrids too. I'll rerun that with the putative hybrids excluded and see if the backbone has better support, I suspect it will.

Kindest,
Molly

> On 28 May 2024, at 11:56 AM, Peter Unmack <peter...@unmack.net> wrote:
>
> Just to add to what Arthur said, often if you have a diverse dataset then pca on the whole dataset may not be overly informative. Things like outgroups will often have small sample size and more missing data and should likely be excluded from pca. It's not really clear though what the taxonomic breadth is for your dataset though.
> To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/2c948560-bc60-466e-ab87-78e2c527e74a%40unmack.net.

Molly Bloomfield

unread,
May 28, 2024, 6:43:29 AMMay 28
to dartR
Hi all,

I'm also curious about the NewHybrids analysis - in the documentation for the gl.nhybrids() funciton, it states that "One might elect to repeat the analysis (method='random') and combine the resultant posterior probabilities should 200 loci be considered insufficient." 
- How would you assess if 200 loci is sufficient? 
- How would you "combine" the resultant posterior probabilities? 

Currently I have run the analysis multiple times and averaged the PP matrices, but I'm not sure if this is the appropriate method. Would frequency of each sample being assigned to a category (PP > 0.8) work better? The median PP?
In the end, I will not be basing any inferences on these results as I don't think the assumption holds that the hybridisation event occurred within the last two generations, but as it's for my masters thesis I would like to understand further. 

Kindest,
Molly

Reply all
Reply to author
Forward
0 new messages