gl.fixed.diff compared to Fst, admixture and PCoA analysis

245 views
Skip to first unread message

Chiara Duijser

unread,
May 10, 2023, 2:01:49 AM5/10/23
to dartR

Hi everyone,


I have a question on how to interpret my data when investigating the presence of cryptic taxa or (sub)species in my dataset. I have collected samples from 4 different habitats and based on morphology they should all be the same species. However, I want to see if there are cryptic species or reproductively isolated populations present due to environmental differences between sites (all 4 sampling sites are within 500m apart).


My PCoA shows some clusters and most of my pairwise Fst values between habitats are >0.25 with the highest 0.746. I then ran an admixture analysis using the snmf function in the LEA package, which indicates the presence of 3 clusters in my dataset. To add to this, I looked at fixed differences between the populations using the Technical Note on the dartR website. I have 6 individuals per site (and 5 for the inner reef since one sample failed dartseq). In my case I grouped inner mangrove and outer mangrove together and inner reef and outer reef since fixed differences were 0 and to enhance sample size. After this, there are only two fixed differences between mangrove (n=12) and reef (n=11). I tested for significance and the distinction between these two OTUs did not significantly exceed the false positive rate (p = 1). Hence, there is insufficient evidence to reject the null hypothesis of the two populations belonging to the one species, i.e. there is insufficient evidence to assume cryptic taxa or subspecies.


So, this analysis suggests that despite the presence of private alleles and some fixed differences, I cannot conclude the presence of cryptic taxa or subspecies in my populations (potentially due to the small sample size?). I’m not entirely sure how to interpret this result as the Fst values for example indicate quite some genetic differentiation between some of my populations and admixture reveals 3 clusters. Would anyone have any suggestions or possibilities why these results don't completely seem to match up, e.g. is it perhaps due to my sample size?


Many thanks in advance!


Kind regards,

Chiara

Renee Catullo

unread,
May 10, 2023, 5:11:16 AM5/10/23
to da...@googlegroups.com
Hi Chiara,

Fst results are about the level of genetic variance in a subpopulation relative to the total genetic variance. If you think about this as a maths equation, if you either increase  the variance in a subpopulation or decrease the total genetic variance, you can inflate your Fst value.

I would take a step back and think about your filtering; did you do either of these things with the filters you used? Were they appropriate to the theory behind the statistic? The reality is that one set of filtering is not appropriate for all statistics. My favourite example is when people retain one SNP per locus when calculating heterozygosity. If you think about it, it makes little sense as heterozygosity is not inherited along a chromosome. My next favourite is when people remove singletons when calculating Tajima’s D - this statistic is based on looking for a deficit of singletons to infer expansion, so particularly an egregious error if it’s a threatened species.

For example high minor allele filtering is likely to increase your Fst as it will decrease both variables, but not necessarily equally. Too high a read depth filter tends to remove heterozygous sites at a greater proportion than homozygous, so you end up with more homozygous SNPs while reducing SNPs that show admixture and diversity.

I’ve definitely learned that pop gen has a significant element of artistry! Have a think about what you did to the data before you put it into the FST calculation, and if that could be the cause.

Cheers,

Renee





--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/5ae18ccd-0d7a-4466-a40a-c1eecfb9920dn%40googlegroups.com.

Arthur Georges

unread,
May 10, 2023, 7:39:31 AM5/10/23
to da...@googlegroups.com
Hi Chiara,

Seems like you have undertaken the analysis correctly. You have 4 sampling sites 500 m apart in what I assume is a sessile species.

You would expect if those sampling sites were relatively isolated to see differences in allelic profiles because of drift, but also because of sampling error in estimating those allelic profiles with the low sample sizes. This would be reflected in the PCoA as some level of structure.

Are they biological species? If they were, you would expect to see the accumulation of fixed allelic differences between them despite their proximity and possibility of low levels of gene flow. You have some small numbers of fixed differences (that is, one corroborated difference), but owing to the low sample sizes this could be sampling error. You tested this against the false positive rate, and the difference was not sustained as significant.  So you have no evidence to reject the null proposition of your samples belonging to the same species.

Note that the absence of fixed differences cannot be overturned by increasing the sample size. Small sample sizes can throw spurious fixed differences, but if all loci share alleles at all loci, no amount of adding additional individuals will overturn that.

All good on that front.

Subspecies -- well that is a different matter. It depends on how they are defined. But the above argument based on fixed differences applies to substantive lineages on independent evolutionary trajectories. No evidence of that either in the fixed difference analysis.

Fst -- that is a measure of departure from a null proposition of departure from Hardy Weinberg equilibrium arising from structure among your sampling sites. This will potentially arise from allele frequency differences between sampling sites, not necessarily "irreversible" fixed allelic differences. So could be indicative of interesting structure among sampling sites, but not necessarily because of cryptic species.

Hope this helps.

Arthur





Bernd.Gruber

unread,
May 10, 2023, 7:46:01 AM5/10/23
to da...@googlegroups.com
Also would be interesting how common (what are the allele frequencies) in the private alleles in the population that has them. As if they are rare then the fact that they are private could be a sample size effect, if they are common then they might be considered real.

I completely agree with Renee it is necessary to think hard about filters before you do an analysis especially at low sample sizes.

Cheers, Bernd

---------


On 10 May 2023, at 16:02, Chiara Duijser <cmdu...@gmail.com> wrote:


--

Chiara Duijser

unread,
May 11, 2023, 2:43:28 AM5/11/23
to dartR
Hi Renee, Arthur and Bernd,

Thank you all for your reactions. This is very helpful since I’m new to analysing this kind of data! I really appreciate it.

I did sample the same coral species (based on visual ID; but different coral species tend to sometimes have very similar morphology) across the 4 habitats so no fixed differences would not surprise me. What I want to find out is whether there might be cryptic species or reproductively isolated populations present in the four different sample sites. I’m interested in this since my habitats have very distinct environmental conditions even though they are located close to each other. 

I performed all analyses on my filtered data, i.e., the calculated Fst values, admixture analysis, AMOVA, etc. For all I used the same filtering parameters in this order:

1. gl5 <- gl.filter.secondaries(gl5)
2. gl5 <- gl.filter.reproducibility(gl5, t=0.995) #checked for 0.85-0.995 and there are no major differences, hence I chose 0.995
3. gl5 <- gl.filter.callrate(gl5, method="loc", threshold=0.95)
4. gl5 <- gl.filter.callrate(gl5, method="ind", threshold=0.80) #nothing extra gets filtered out
5. gl5 <- gl.filter.rdepth(gl5, lower=5)
6. gl5 <- gl.filter.maf(gl5, threshold=0.05)
7. gl5 <- gl.filter.monomorphs(gl5)
8. gl5 <- gl.impute(gl5, method="neighbour")

I read that missing data can destroy the metricity of a distance matrix even though the distance measure upon which it is based is a metric distance. Out of the 3 strategies to deal with this:
1. Filter stringently on Call Rate
2. Remove loci with missing values; although this is very wasteful of data
3. Impute missing values based on observed allele frequencies within populations (method="frequency") or you can use method="neighbour" where you substitute missing values for the focal individual with the values taken from the nearest neighbour.
I used the impute function using nearest neighbour method. However, with the impute function do I get the message/warning that some residual missing values were filled randomly drawing from the global allele profiles by locus, since there were not more individuals to impute. I’m not sure if this impacts my analysis?

I’ve looked at the impact of read depth (with lower=5 and without filtering on read depth) on my admixture plots and there are no major changes in the admixture plots. And I read a paper by Linck and Battey (2019) on the impact of minor allele frequency threshold on population structure, thanks for bringing this to my attention @Renee, I wasn’t aware of the impact of filtering on these metrics! Based on reading this, I think both my filtering threshold on MAF and read depth is not too high and therefore not really causing these differences. But please correct me if I’m wrong!

@Bernd, would you suggest using the gl.report.pa() and looking at the ADF to investigate the allele frequencies of the private alleles in the populations?

Thank you,
Chiara
Op woensdag 10 mei 2023 om 21:46:01 UTC+10 schreef Bernd. Gruber:

Bernd.Gruber

unread,
May 11, 2023, 7:47:46 PM5/11/23
to da...@googlegroups.com

Hi Chiara,

 

For the private alleles study I would not use the impute step (as it may simply introduce some private/[fixed=unlikely] allele to your data set. Here I would argue to leave the NAs (there are not many in your data set after your filter for callrate I assume anyway and check per hand how many NAs there are in the private allele loci afterwards).

 

To identify the private alleles, there would be some additional code needed as currently gl.report.pa does only give a summary stastistics. It should not be too difficult to do so (you may want to look at the gl.report.pa function.

 

Let me know if you struggle I wanted to have this implemented (to get the actual loci) for a while but did not come around and that might be a good reason to do so.

 

Cheers, Bernd

 

 

 

==============================================================================

Dr Bernd Gruber                                              )/_         

                                                         _.--..---"-,--c_    

Professor Ecological Modelling                      \|..'           ._O__)_     

Tel: (02) 6206 3804                         ,=.    _.+   _ \..--( /          

Fax: (02) 6201 2328                           \\.-''_.-' \ (     \_          

Institute for Applied Ecology                  `'''       `\__   /\          

Faculty of Science and Technology                          ')                

University of Canberra   ACT 2601 AUSTRALIA

Email: bernd....@canberra.edu.au

WWW: bernd-gruber

 

Australian Government Higher Education Provider Number CRICOS #00212K 

NOTICE & DISCLAIMER: This email and any files transmitted with it may contain
confidential or copyright material and are for the attention of the addressee
only. If you have received this email in error please notify us by email
reply and delete it from your system. The University of Canberra accepts
no liability for any damage caused by any virus transmitted by this email.

==============================================================================

Jose Luis Mijangos

unread,
May 11, 2023, 8:02:25 PM5/11/23
to dartR
Hi Bernd and Chiara,

The implementation of getting the name of the loci with private alleles and fixed differences is already there in the function gl.report.pa. You can use the parameter loc_names = TRUE, loci names with private alleles and fixed differences are reported in a list in addition to the dataframe. 

Also, the function also reports an estimation of the lower bound of the number of undetected private alleles using the Good-Turing frequency formula,  originally developed for cryptography, which estimates in an ecological context the true frequencies of rare species in a single assemblage based on
an incomplete sample of individuals. The approach is described in Chao et al. (2017). For this function, the equation 2c is used. This estimate is reported in the output table as Chao1 and Chao2.

Cheers,
Luis 

Bernd.Gruber

unread,
May 11, 2023, 8:07:09 PM5/11/23
to da...@googlegroups.com

Thanks Luis. I remembered we talked about it, but I was not aware that you have already done it.

 

@Chiara: So then you simple can use the gl.keep.loc with the locnames returned by the function.

 

Cheers, Bernd

 

 

 

 

 

==============================================================================

Dr Bernd Gruber                                              )/_         

                                                         _.--..---"-,--c_    

Professor Ecological Modelling                      \|..'           ._O__)_     

Tel: (02) 6206 3804                         ,=.    _.+   _ \..--( /          

Fax: (02) 6201 2328                           \\.-''_.-' \ (     \_          

Institute for Applied Ecology                  `'''       `\__   /\          

Faculty of Science and Technology                          ')                

University of Canberra   ACT 2601 AUSTRALIA

Email: bernd....@canberra.edu.au

WWW: bernd-gruber

 

Australian Government Higher Education Provider Number CRICOS #00212K 

NOTICE & DISCLAIMER: This email and any files transmitted with it may contain
confidential or copyright material and are for the attention of the addressee
only. If you have received this email in error please notify us by email
reply and delete it from your system. The University of Canberra accepts
no liability for any damage caused by any virus transmitted by this email.

==============================================================================

 

From: da...@googlegroups.com <da...@googlegroups.com> On Behalf Of Jose Luis Mijangos
Sent: Friday, 12 May 2023 10:02
To: dartR <da...@googlegroups.com>
Subject: Re: [dartR] gl.fixed.diff compared to Fst, admixture and PCoA analysis

 

Hi Bernd and Chiara,

Chiara Duijser

unread,
May 13, 2023, 6:05:35 AM5/13/23
to dartR
Thank you Bernd and Luis for the extra information!
I used loc_names=TRUE and can indeed see the long list of private and fixed loci in my dataset. This is helpful!

Kind regards,
Chiara

Op vrijdag 12 mei 2023 om 10:07:09 UTC+10 schreef Bernd. Gruber:
Reply all
Reply to author
Forward
0 new messages