species delineation with SNP data. pop column in metadata for dartR

Kate Moffatt

unread,

May 31, 2023, 9:05:02 PM5/31/23

to dartR

Hi everyone,

I am hoping to use my SNP data for species delineation of a genus of mammals (6 species in the group, however 1 species may not be valid). I am wanting to see the relationships between the "core" group of species (closely related), and the outgroup (the most genetically distinct species), to determine taxonomic status/validity of all species in the group.

In the metadata csv I need to create for dartR it mentions I need two special individual metrics: id (unique identifier for specimens) and pop (a label for the biological population from which the individual was drawn).

I am aware that often people use the pop column to identify a geographic population of a particular species, to understand population genetics.

If I am looking at species relationships within a genus, is an appropriate way to format this column for my data to give species name labels in this pop column? Or will this create errors in the analysis and I need to label by geographic populations?

Thanks,

Kate

Jose Luis Mijangos

unread,

Jun 6, 2023, 11:23:45 PM6/6/23

to dartR

Hi Kate,

The decision to use the species names as population depends on the analysis and the hypothesis you want to test. You would need to read the analysis methods to understand how population assignments are used in each analysis.

For species delineation, you could use a fixed difference analysis. You can find our tutorial for this analysis here:

http://georges.biomatix.org/storage/app/media/uploaded-files/TechNote_fixed_difference_analysis_25-Feb-22.pdf

In this specific analysis, you would use the species names as populations.

To change the population names easily you can use the dartR's function "gl.edit.recode.pop"

Cheers,

Luis

Kate Moffatt

unread,

Jun 7, 2023, 1:24:52 AM6/7/23

to dartR

Hi Luis,

thank you for this information. It was very helpful.

I am hoping to create a ML tree in IQ-TREE using my filtered SNP dataset to see where species and clades are positioned in the tree. Is it therefore suitable to label the pop column as species names? As this should not influence their positioning in the ML tree? but in turn are acting as a label for the biological population (aka. species)... and the tree will be an indication of how well resolved the clades are? any advice would be greatly appreciated.

Thanks for the suggestion of a fixed difference analysis, this looks ideal. However, it mentioned that samples sizes are recommended to be >10 and comprehensively sampled across the geographic range. Unfortunately, for my group of taxa, it was incredibly hard to get lots of samples for each species and to sample across their entire geographic range.

Instead, I have 6 species, with each species ranging from 8 - 19 individuals. Would it still be appropriate to conduct a fixed different analysis even if sample numbers /geographic sampling is not ideal and sometimes less than 10 for a sp?

Thanks,

Kate

Arthur Georges

unread,

Jun 8, 2023, 9:01:56 AM6/8/23

to dartR

Hi Kate,

There are differing views on species delimitation arising in part because of differing views on what is being referred to as a species. You seem to have some pre-determined putative species, and provided you are sure they are good entities (discrete units on independent evolutionary trajectories) then a phylogeny can be most informative. Substantively divergent clades (lineages) might be called species.

Coming at it from a popgen perspective, the question becomes is there sufficient evidence to challenge the null proposition that your "species" are one and the same. Fixed allelic differences (in number exceeding the false positive rate) between two putative taxa in sympatry yields a serious challenge to that null proposition. Slam dunk.

In parapatry, where one may assume that there is some opportunity for exchange of individuals even at a low level, or episodically, fixed allelic differences greater in number than the false positive rate also challenge the null proposition that two putative taxa are from the same species. If not substantial in number, you might have two species, but not enough evidence to demonstrate it.

In allopatry, as always, the challenge defies a definitive answer. If the number of fixed differences does not exceed the false positive rate then you have no evidence to challenge the null proposition. But if you do get substantial numbers of fixed differences the decision is subjective -- are they different enough to be considered separate species, or just geographic isolates (lineages within species).

So a fixed difference analysis can be useful.

You usually do not want to start from a preconception of the species you have at hand, but let the analysis generate the amalgamations of sample sites based on absence of fixed differences (in the knowledge that low sample sizes will not obscure fixed differences but rather inflate them through false positives) until you have a set of putative diagnosable units. If they match up with your six species all is well.

If you do not have enough samples to do that, then you can work with your putative species (combining localities) and do the fixed difference analysis. No fixed differences in excess of the false positive rate, no defensible evidence of two species -- null proposition cannot be rejected. Substantive fixed differences (depending on the spatial context) can be used in argument to support different species.

If you do not have comprehensive sampling across the range of the putative species, beware also of the "separated demes on a poorly sampled cline effect" or "arbitrary slices of continuous geographic clines "-- refer to doi: 10.1093/sysbio/syz042.

A couple of our papers that might give you some more background are:

http://georges.biomatix.org/storage/app/uploads/public/61c/192/52b/61c19252ba589941621011.pdf

http://georges.biomatix.org/storage/app/uploads/public/5ce/31d/464/5ce31d464fed3159371065.pdf

Really hope this helps and that I have not just confused you as much as I often confuse myself with this stuff.

Arthur

pop=species is fine.

Kate Moffatt

unread,

Jun 13, 2023, 12:41:15 AM6/13/23

to dartR

Hi Arthur and All,

It appears I sent my response to “reply to author” and I am not entirely sure you may have received my response. So I am writing here for the group to see the conversation (and hopefully benefit from this query) and on the off chance that you did not receive an email from me.

Firstly, I would like to thank you for the detailed information and help. It is greatly appreciated and I found this information incredibly informative and helpful. I have read the attached papers (great research!) and that gave me a better understanding of fixed difference analysis along with the tutorial provided. Thank you.

I ran the fixed difference analysis with my 6 putative species and I got some interesting results that I am hoping to confirm are likely correct. Two of the species (sp2 and sp5) could find no fixed differences between them (0), which was to be expected as it is likely that my research will sink one of these species (sp5) and synonymise it with its former species (sp2). So this was great news and an interesting result. The remainder of fixed differences between each population (species) appears to match my other genetic, morphological and phylogenetic research, so that is good news too.

My concern with the following analysis and my results is that all my p-values returned as 0. This makes me question if I have done the analysis right or if this is an error and I somehow need to fix this? Could you please let me know if a p-value of 0 is possible with this analysis or unexpected? and if so, where I might have gone wrong? Any help would be greatly appreciated.

Below is my code with the output…

D <-gl.fixed.diff(hwe95, v=4)

Comparing populations for absolute fixed differences

Monomorphic loci removed
Populations, aggregations and sample sizes
Sp1 sp2 sp3 sp4 sp5 sp6
8 19 13 8 10 15
D$fd Sp1 sp2 sp3 sp4 sp5
Sp2 840
Sp3 931 208
Sp4 858 161 262
Sp5 893 0 221 173
Sp6 1002 586 685 606 640

D2 <- gl.collapse (D, tpop=1, verbose =3)

> D2 <-gl.collapse(D, tpop=1, verbose =3)
Starting gl.collapse
Processing a fixed difference (fd) object with SNP data
Comparing populations for absolute fixed differences
Amalgamating populations with corrobrated fixed
differences, tpop = 1
Initial Populations
Sp1 sp2 sp3 sp4 sp5 sp6
New population groups
Group: sp2+
[1] "sp2" "sp5"

Sample sizes
Sp1 sp2+ sp3 sp4 sp6
8 29 13 8 15

D2$fd

Sp1 sp2+ sp3 sp4
Sp2+ 806
Sp3 931 191
Sp4 858 148 262
Sp6 1002 554 685 606

D3 <-gl.collapse (D2, tpop=1, verbose =3)

D3 <- gl.collapse(D2, tpop=1, verbose=3)
Starting gl.collapse
Processing a fixed difference (fd) object with SNP data
Comparing populations for absolute fixed differences
Amalgamating populations with corrobrated fixed
differences, tpop = 1
Initial Populations
Sp1 sp2+ sp3 sp4 sp6
New population groups

No further amalgamation of populations at fd <= 1
Analysis complete

D3$fd

Sp1 sp2+ sp3 sp4
Sp2+ 806
Sp3 931 191
Sp4 858 148 262
Sp6 1002 554 685 606

D4 <- gl.fixed.diff(D3, test=TRUE, alpha=0.05, v=3, reps=1000)

D4$pval

Starting gl.fixed.diff
Processing a fixed difference (fd) object with SNP data
Comparing populations for absolute fixed differences
Monomorphic loci removed
Populations, aggregations and sample sizes
Sp1 sp2+ sp3 sp4 sp6
8 29 13 8 15
Warning: Fixed differences can arise through sampling error if sample sizes are small
Some sample sizes are small (N < 10, minimum in dataset = 8 )
Comparing populations pairwise -- this may take time. Please be patient
Completed: gl.fixed.diff

D4$pval
Sp1 sp2+ sp3 sp4 sp6
Sp1 0 0 0 0 0
Sp2+ 0 0 0 0 0
Sp3 0 0 0 0 0
Sp4 0 0 0 0 0
Sp6 0 0 0 0 0