species delineation with SNP data. pop column in metadata for dartR

66 views
Skip to first unread message

Kate Moffatt

unread,
May 31, 2023, 9:05:02 PM5/31/23
to dartR
Hi everyone, 

I am hoping to use my SNP data for species delineation of a genus of mammals (6 species in the group, however 1 species may not be valid). I am wanting to see the relationships between the "core" group of species (closely related), and the outgroup (the most genetically distinct species), to determine taxonomic status/validity of all species in the group. 

In the metadata csv I need to create for dartR it mentions I need two special individual metrics: id (unique identifier for specimens) and pop (a label for the biological population from which the individual was drawn). 

I am aware that often people use the pop column to identify a geographic population of a particular species, to understand population genetics. 

If I am looking at species relationships within a genus, is an appropriate way to format this column for my data to give species name labels in this pop column? Or will this create errors in the analysis and I need to label by geographic populations? 

Thanks, 
Kate 

Jose Luis Mijangos

unread,
Jun 6, 2023, 11:23:45 PM6/6/23
to dartR
Hi Kate,

The decision to use the species names as population depends on the analysis and the hypothesis you want to test. You would need to read the analysis methods to understand how population assignments are used in each analysis. 

For species delineation, you could use a fixed difference analysis. You can find our tutorial for this analysis here:


In this specific analysis, you would use the species names as populations. 

To change the population names easily you can use the dartR's function "gl.edit.recode.pop"

Cheers,
Luis 

Kate Moffatt

unread,
Jun 7, 2023, 1:24:52 AM6/7/23
to dartR
Hi Luis, 

thank you for this information. It was very helpful. 

I am hoping to create a ML tree in IQ-TREE using my filtered SNP dataset to see where species and clades are positioned in the tree. Is it therefore suitable to label the pop column as species names? As this should not influence their positioning in the ML tree? but in turn are acting as a label for the biological population (aka. species)... and the tree will be an indication of how well resolved the clades are? any advice would be greatly appreciated. 

Thanks for the suggestion of a fixed difference analysis, this looks ideal. However, it mentioned that samples sizes are recommended to be >10 and comprehensively sampled across the geographic range. Unfortunately, for my group of taxa, it was incredibly hard to get lots of samples for each species and to sample across their entire geographic range.

Instead, I have 6 species, with each species ranging from 8 - 19 individuals. Would it still be appropriate to conduct a fixed different analysis even if sample numbers /geographic sampling is not ideal and sometimes less than 10 for a sp? 

Thanks, 
Kate

Arthur Georges

unread,
Jun 8, 2023, 9:01:56 AM6/8/23
to dartR
Hi Kate,

There are differing views on species delimitation arising in part because of differing views on what is being referred to as a species. You seem to have some pre-determined putative species, and provided you are sure they are good entities (discrete units on independent evolutionary trajectories) then a phylogeny can be most informative. Substantively divergent clades (lineages) might be called species.

Coming at it from a popgen perspective, the question becomes is there sufficient evidence to challenge the null proposition that your "species" are one and the same. Fixed allelic differences  (in number exceeding the false positive rate) between two putative taxa in sympatry yields a serious challenge to that null proposition. Slam dunk.

In parapatry, where one may assume that there is some opportunity for exchange of individuals even at a low level, or episodically, fixed allelic differences greater in number than the false positive rate also challenge the null proposition that two putative taxa are from the same species. If not substantial in number, you might have two species, but not enough evidence to demonstrate it.

In allopatry, as always, the challenge defies a definitive answer. If the number of fixed differences does not exceed the false positive rate then you have no evidence to challenge the null proposition. But if you do get substantial numbers of fixed differences the decision is subjective -- are they different enough to be considered separate species, or just geographic isolates (lineages within species).

So a fixed difference analysis can be useful.

You usually do not want to start from a preconception of the species you have at hand, but let the analysis generate the amalgamations of sample sites based on absence of fixed differences (in the knowledge that low sample sizes will not obscure fixed differences but rather inflate them through false positives) until you have a set of putative diagnosable units. If they match up with your six species all is well.

If you do not have enough samples to do that, then you can work with your putative species (combining localities) and do the fixed difference analysis. No fixed differences in excess of the false positive rate, no defensible evidence of two species -- null proposition cannot be rejected. Substantive fixed differences (depending on the spatial context) can be used in argument to support different species.

If you do not have comprehensive sampling across the range of the putative species, beware also of the "separated demes on a poorly sampled cline effect" or "arbitrary slices of continuous geographic clines "-- refer to doi: 10.1093/sysbio/syz042.

A couple of our papers that might give you some more background are:


Really hope this helps and that I have not just confused you as much as I often confuse myself with this stuff.

Arthur
pop=species is fine.

Kate Moffatt

unread,
Jun 13, 2023, 12:41:15 AM6/13/23
to dartR

Hi Arthur and All,

It appears I sent my response to “reply to author” and I am not entirely sure you may have received my response. So I am writing here for the group to see the conversation (and hopefully benefit from this query) and on the off chance that you did not receive an email from me.

Firstly, I would like to thank you for the detailed information and help. It is greatly appreciated and I found this information incredibly informative and helpful. I have read the attached papers (great research!) and that gave me a better understanding of fixed difference analysis along with the tutorial provided. Thank you.

I ran the fixed difference analysis with my 6 putative species and I got some interesting results that I am hoping to confirm are likely correct. Two of the species (sp2 and sp5) could find no fixed differences between them (0), which was to be expected as it is likely that my research will sink one of these species (sp5) and synonymise it with its former species (sp2). So this was great news and an interesting result. The remainder of fixed differences between each population (species) appears to match my other genetic, morphological and phylogenetic research, so that is good news too.

My concern with the following analysis and my results is that all my p-values returned as 0. This makes me question if I have done the analysis right or if this is an error and I somehow need to fix this? Could you please let me know if a p-value of 0 is possible with this analysis or unexpected? and if so, where I might have gone wrong? Any help would be greatly appreciated.  

Below is my code with the output…

 

D <-gl.fixed.diff(hwe95, v=4)

   Comparing populations for absolute fixed differences

  Monomorphic loci removed
  Populations, aggregations and sample sizes
         Sp1                sp2              sp3         sp4           sp5              sp6
              8              19              13               8              10              15
  D$fd                     Sp1             sp2     sp3      sp4   sp5
Sp2                 840                                      
Sp3                 931             208                      
Sp4                 858             161     262              
Sp5                 893               0     221      173      
Sp6                 1002          586     685      606   640

D2 <- gl.collapse (D, tpop=1, verbose =3)

 > D2 <-gl.collapse(D, tpop=1, verbose =3)
Starting gl.collapse
  Processing a fixed difference (fd) object with SNP data
  Comparing populations for absolute fixed differences
  Amalgamating populations with corrobrated fixed
                    differences, tpop = 1
Initial Populations
 Sp1  sp2  sp3 sp4  sp5 sp6
New population groups
Group: sp2+
[1] "sp2" "sp5"          

Sample sizes
              Sp1           sp2+                sp3              sp4            sp6
               8               29               13                8               15

 D2$fd

                          Sp1             sp2+     sp3      sp4
Sp2+                806                                  
Sp3                  931              191                
Sp4                  858              148       262        
Sp6                 1002             554       685      606

 

D3 <-gl.collapse (D2, tpop=1, verbose =3)

D3 <- gl.collapse(D2, tpop=1, verbose=3)
Starting gl.collapse
  Processing a fixed difference (fd) object with SNP data
  Comparing populations for absolute fixed differences
  Amalgamating populations with corrobrated fixed
                    differences, tpop = 1
Initial Populations
 Sp1 sp2+  sp3  sp4  sp6
New population groups

No further amalgamation of populations at fd <= 1
  Analysis complete

D3$fd

                    Sp1               sp2+    sp3      sp4
Sp2+                 806                                  
Sp3                  931              191                
Sp4                  858              148     262        
Sp6                 1002              554     685      606
             

D4 <- gl.fixed.diff(D3, test=TRUE, alpha=0.05, v=3, reps=1000)

D4$pval

Starting gl.fixed.diff
  Processing a fixed difference (fd) object with SNP data
  Comparing populations for absolute fixed differences
  Monomorphic loci removed
  Populations, aggregations and sample sizes
               Sp1             sp2+            sp3              sp4               sp6
               8               29               13                8               15
  Warning: Fixed differences can arise through sampling error if sample sizes are small
    Some sample sizes are small (N < 10, minimum in dataset = 8 )
  Comparing populations pairwise -- this may take time. Please be patient
Completed: gl.fixed.diff

D4$pval
                       Sp1              sp2+   sp3      sp4       sp6
Sp1                    0                0       0        0         0
Sp2+                  0                0       0        0         0
Sp3                    0                0       0        0         0
Sp4                    0                0       0        0         0
Sp6                    0                0       0        0         0


Thank you very much for all your help so far

Kate

Reply all
Reply to author
Forward
0 new messages