Speedemon/Stacey collapse height

255 views
Skip to first unread message

Dylan O'Hearn

unread,
Feb 1, 2023, 12:09:17 PM2/1/23
to beast-users
Hello,

I'm beginning to use Speedemon and I note the advice on selecting a value of epsilon:
" The threshold ϵ describes the maximum divergence that can be tolerated before two samples are regarded as separate species. If ϵ is too large (e.g. older than the tree), then all samples will be lumped into one species. Whereas, if ϵ is too small (e.g. younger than the youngest divergence time), then all individuals will be split into their own species."

Whereas in the Stacey manual, it's described like this:
"Collapse Height is denoted ϵ  in the DISSECT paper. This is a computational approximation to zero. It has no biological meaning, and its value is simply a trade-off between speed and accuracy. Smaller is more accurate, but slower. I suggest using a value which is about 1/100 or 1/1000 of a typical species tree branch length. The value is not critical, in that a wide range of values (such as from 0.000001 to 0.0001) usually produce similar results in similar times"

Is there a difference in the two methods that makes these not comparable, or is it a matter of what's considered "accurate"?  I'm running my dataset using multiple values of epsilon to see what effect it has, but if it's as simple as saying that smaller is better, that would be pretty convenient.

Thanks!

Jordan Douglas

unread,
Feb 2, 2023, 3:44:19 PM2/2/23
to beast-users
Hi,

Yes these two models are equivalent. These two paragraphs are different descriptions of the same parameter by different authors (myself and Graham Jones). Perhaps it is true up to an extent, but I do not believe that a lower epsilon is always more accurate. For example, in our bush baby analysis in the speedemon article, we found that when epsilon=10 years, the results were nonsense because epsilon was too small relative to generation time. However, the larger point here is that the analyses appear to be quite robust to the choice of epsilon - provided that epsilon is somewhat sensible.   

Jordan

Dylan O'Hearn

unread,
May 5, 2025, 9:50:00 PM5/5/25
to beast-users
I am revisiting this method and I'm not entirely sure I understand its intended use.  It seems like epsilon is basically a fixed, semi-arbitary choice you can make regarding what degree of divergence justifies calling two lineages different species. I.e., if the node height of the MRCA of two lineages of interest is 5e-4, then your arbitrary decision of using either 1e-3 or 1e-4 as epsilon will give you a result of one species or two species, respectively, and testing a range of epsilon values, as recommended, tells you nothing other than that any value of epsilon below that node height will split them, and vice versa.

Is there some detail of the model, like the collapse weight, that substantially changes this?  It seems like you might as well just do an ordinary phylogenetic analysis in Starbeast3 and make the species delimitation determination based on the node height.  What is the added value of the SPEEDEMON (or DISSECT etc.) analysis?

Luke Baton

unread,
May 6, 2025, 2:21:54 PM5/6/25
to beast-users
Dylan,

I'm by no means an expert, but my understanding is this...

Yes, SPEEDEMON and STACEY both use the same arbitrary threshold to determine whether or not to clump lineages together as the same "species" (but all species delimitation methods use an arbitrary threshold!). However, SPEEDEMON (and to a lesser extent STACEY) are still at least more preferable than other current methods, as they use the most realistic models of evolution, and allow model-based support values to be inferred for the species delimitations (which most other methods do not).

In current implementations, epsilon has a single constant value, but this does not necessarily mean that the threshold used for species identification is "fixed". The nature of the threshold depends on the collapse weight used. The threshold is only absolutely fixed and deterministic, if the collapse weight is fixed to a constant value of 1. Otherwise, if the collapse weight is >0 and <1, the threshold is probabilistic, and it is possible for lineages above epsilon to be clumped, and lineages below epsilon to remain separate, depending on the mostly likely tree topologies, branch lengths and number of species, etc. (Anyone with the relevant expertise, please contradict me, if I am wrong about this!). The collapse weight itself can either be arbitrarily defined by the user (either as a single value or a prior distribution), or empirically inferred from the data itself as a posterior distribution.

The difference between SPEEDEMON/STACEY and a conventional Starbeast3/*BEAST analysis is the exploration of "tree/species delimitation" space. Other things being equal, if the collapse weight is fixed to a single constant value of 0, then the analyses are the same (because no clumping of linegages is possible when the collapse weight is 0, as epsilon is not used). However, if the collapse weight is >0, then SPEEDEMON/STACEY will explore much larger areas of "tree/species delimitation" space than Starbeast3/*BEAST (with the latter two methods only able to explore a much narrower region of tree space for only one possible species delimitation). For this reason, the species and gene trees inferred by SPEEDEMON/STACEY might be quite different (both in terms of topology, branch lengths and number of species) from those of Starbeast3/*BEAST.

So, for these reasons, simply drawing a line at epsilon across the nodes of the species (or gene) tree from a conventional Starbeast3/*BEAST analysis might give quite different results from a SPEEDEMON/STACEY analysis (and would seemingly undermine the purpose of using coalescent-based methods, which is to explicitly incorporate lineage sorting into the species delimitation analysis, not just phylogenetic inference).

Luke

Dylan O'Hearn

unread,
May 6, 2025, 5:50:38 PM5/6/25
to beast-users
Thank you Luke, that is very helpful.  In my case, looking at the node height and cluster count traces, collapsing vs splitting was pretty tightly associated with the node height falling below epsilon, but it's good to know that that's somewhat due to the influence of the data rather than a strict arbitrary threshold.  I put a uniform (0,1) prior on the collapse weight but it clustered pretty strongly at a low value, suggesting the data are informative in that regard.

Thanks again!

bengt....@gmail.com

unread,
May 8, 2025, 2:52:29 AM5/8/25
to beast-users
Many thoughtful comments here, but just a quick note regarding epsilon. In the multispecies coalescent (MSC) model, the "species" are defined as Wright-Fisher populations (rendom reproduction etc). When epsilon is low enough, it approximates zero, which means that any branching event lower than this is ignored. In this way, DISSECT/STACEY avoids the problem of dealing with species trees with variable numbers of tips. It is just that branching events below epsilon are "ignored". BPP solves this in a different way, it uses reversible model jumps in the MCMC. Thus, the methods uses the same model, but solves the delimitation problem in different ways. You can test different values of epsilon to asess when it is low enough. If the results stay the same as you try lower values, it is an indication that you have a good approximation of zero. See the DISSECT paper for an example. So, as the STACEY documentation says, epsilon is not arbitrary at all. Say for example that you have two minimal clusters with identical or near identical sequences. They will be considered as belonging to the same "species in the MSC model, because the gene trees will always fit the model. It may be true that when your sequence are highly informative, every minimal cluster (e.g., individual organisms) will be their own species, but only if the MSC model is violated (which it probably is). 
The authors of Speedemon suggest a different use of epsilon, that is to select a reasonable value so that it fits the taxonomic species concept of the group you study. Contrary to the definition of epsilon as an approximation of zero, this is purely arbitrary.

Luke Baton

unread,
May 8, 2025, 2:09:44 PM5/8/25
to beast-users
Dear Bengt,

I think that you have misunderstood the sense in which Dylan and I were describing epsilon as "arbitrary". Of course, mathematically, as part of the algorithm used by SPEEDEMON/STACEY it is clearly defined (so this parameter is not arbitrary in this sense!). However, the parameter epsilon defines is a mathematical fudge, which does not correspond to a known biological characteristic that causally defines real, empirical species. In addition, the particular value that epilson takes is also arbitrary (even if it might be empirically-informed, there is no theoretical justification for whatever value used being a biologically defining characteristic of real species). This is why Douglas & Bouckaert (2022) described epsilon as "some user-defined threshold" and what they meant when they wrote with no little understatement that "it would be beneficial to have a method which explicitly estimates the species assignment function without the need for such a heuristic [i.e., epsilon]". To paraphrase, what they meant is that it would be better if the method used a biologically realistic criterion to assign species membership of lineages (rather than an arbitrary mathematical fudge with no biological meaning, which is subjectively given by the user)!! Of course, such a criterion is the Holy Grail of species delimitation studies, but no one has discovered it yet (and probably it doesn't exist using the MSC)! The MSC model has already been criticized as being inherently unable to do this (i.e., identify biologically, without arbitrary user cut-offs, species boundaries) - see Sukumaran & Knowles (2017) and Leache et al (2019).

Luke

Bengt Oxelman

unread,
May 8, 2025, 3:47:38 PM5/8/25
to beast...@googlegroups.com
Hi Luke, 
 I have understood perfectly what you say, but you miss my point. The whole idea of Dissect/Stacey is to delimit MSC ’species’ which are the branches of the species tree and mathematically well defined. I agree that they are not corresponding well to most (any?) taxonomic species circumscribed. As you say yourself, someone has to come up with a biologically explicit function for this. Maybe you mean that if duch s function existed, a non-arbitrary value for epsilon would be possible to delimit species. However we seem far from having such functions, as it seems that we cannot even define such species in words. By using epsilon as an approximation of zero, the species tree space can  be explored as usual, and the delimited species will be inferred as ideal Wright-Fisher populations. Note that the MSC also makes this assumption to start with. As biologists and faced with reality, we always have to do simplifying assumptions in our models, the question is if the models are useful anyway. 
If you are trying to say that the MSC is not useful, that’s fine, but ut has nothing to do with wether a mathematical approximstion is arbitrary. It is not.


--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/3e_IKShCH3s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/beast-users/cc073415-9cad-4bd4-ac7c-fdfda8f003afn%40googlegroups.com.

Luke Baton

unread,
May 8, 2025, 6:39:13 PM5/8/25
to beast-users
Bengt,

I think that we are talking at cross purposes here, and that we do not understand one another!

There are two different things here: (1) the MSC, and (2) species delimitation (= population/lineage assignment to taxa). You seem to be confusing them. The MSC is not at all arbitrary, but using the parameter epsilon to infer species delimitations is.

The MSC is based in population genetic theory, which describes lineage sorting between different populations. Despite its name, the MSC by itself does not allow the inference of species delimitations, because it provides no means for distinguishing intra- from interspecific populations/lineages: it can only identify independently evolving populations/lineages (not whether or not these populations/lineages belong to the same or different species). Currently, something else (i.e., external criterion/information) needs to be used to assign the populations/lineages inferred by the MSC to species...

epsilon is not part of the MSC, and is not derived from population genetic theory, nor is it derived from any theory of the process of speciation or the nature of species: it is an artifically created additional threshold parameter which is used as a cut-off to arbitrarily determine where species boundaries should be drawn, in terms of whether or not sister lineages in the "species tree" should be collapsed according to their node heights. Computationally, epsilon also happens to facilitate Bayesian analysis without the need for rjMCMC. The use of epsilon in this context is analogous to use of thresholding in single-locus coi-based DNA barcoding (which has long been discredited). 

So, when I said that a "biologically explicit function" is needed, which identifies species boundaries, I did not mean so that it could provide a non-arbitrary value for epsilon! I meant so that it could REPLACE epsilon, with a function based in population genetic and/or speciation theory (rather than using node height as a threshold, which has no theoretical justification as a species boundary)!! (Although that extra biological realism might screw-up the use of epsilon as a computational strategy to avoid having to use rjMCMC). However, it isn't currently obvious that such a "biologically explicit function" exists (as you suggest). epsilon might be the best we can do, but that doesn't stop it being arbitrary.

I did not say that the MSC is not useful, and I completely disagree with you when you said that MSC-based species delimitation is not "corresponding well to most (any?) taxonomic species circumscribed". It depends on the data-set, and the value of epsilon chosen (but the latter is the rub).

Omar Idris

unread,
May 8, 2025, 9:27:09 PM5/8/25
to beast...@googlegroups.com
Whatever method you choose to use no single method is sufficient although better. In my opinion I use integrated approach and your choice of epsilon should be to maximize the results that agree with the more integrated approach. As your species model is allopathic or gene flow with isolation you should have other additional evidence to not be arbitrarily choose a threshold. Until then every species delimitation as just an expert opinion or a hypothesis!

ONI


You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/beast-users/a3a40437-a60f-461d-99b7-dace48c35e74n%40googlegroups.com.

Bengt Oxelman

unread,
May 9, 2025, 11:57:02 AM5/9/25
to beast...@googlegroups.com

Luke,

 

Don’t worry, I do understand you, believe me. I agree with most of what you say, and using epsilon to inform species delimitation is a bad idea and was never the intention. This is based a misinterpretation of what epsilon is, so I will spend some text trying to explain it.

 

Coalescent theory models allele histories in populations, which in its most simple form assumes no recombination, no natural selection, and no gene flow or population structure. Rannala and Yang (2003) connected such populations in a tree. This model is now known as the multispecies coalescent (MSC) model. The choice of the word “multispecies” is unfortunatete, and likely an important source of the confusiuon in the discussion here, as well as otherwise, because what it does is really connecting multiple ideal populations in a tree, not what most biologists think of when they use the term ‘species’. As we all know, there are many concepts of species, and there is also disagreements about whether ‘species’ is real biological entity with unique properties or not. But a t least, it is a taxonomic rank,  and taxonomic species are given binomial names, and species may have legal issues. Some species (Homo sapiens) are paradigmatic, whereas others are controversial. For example, despite the title of his most well-known book, considered “… species as one arbitrarily given for the sake of convenience to a set of individuals closely resembling each other, and that it does not essentially differ from the term variety”. In contrast, both Linnaeus and Ernst Mayr considered species as real units, created by God in the former cases and as the basic units of evolution in the latter. Much of modern biology is highly influenced by the Mayrian concept, but nevertheless, a unified and explicit universal model into which biologists can use their data to test hypothesis is lacking, and even on the conceptual side, there is a lot of disagreement (even if the ‘lineage’ concept sensu de Queiroz and others is gaining acceptance, it lacks operational criteria).

 

So, we are faced with a semantic problem here. Although we agree that the MSC models populations, the inclusion of the term ‘species’ has caused confusion. Say we have a set of alleles and under the assumption that these alleles have evolved according to the coalescent assumptions, are these alleles sampled from one or more such populations? We can use the MSC to compare the likelihood of our data under different models here, for example using model selection methodology as the likelihood ratio test. In a Bayesian framework, reversible model jump can be performed, where branches are collapsed and expanded. In a simple case, we may just compare the cases of one population (‘species’) versus two. The likelihood of the “two” case will also be affected by the depth of the split.  In DISSECT/STACEY/Speedemon the number of tips is kept constant, which has some mathematical/computational advantages. This is implemented by using a modified branching model. Instead of the usual birth (or birth/death) model, in which the probability of branching events is a priori assumed to be constant across the branches, very high prior probabilities are assigned to very shallow branching events (defined by epsilon), so shallow that they can be assumed to approximate zero depth. So, if our sampled alleles are informative enough of deeper branching events, we will conclude that we have two populations (‘species’). If our data is not informative, then we conclude that we have one (actually, it would be more correct to say that we do not have evidence for more than one, and that the statistical power is low). Note though that because of computational reasons, there must be a measurable split height, which is defined by epsilon, and as is clearly stated in the STACEY/DISSECT documentation, this should be as small as possible. This is the intended use of epsilon, and IT IS NOT ARBITRARY in this sense. It is zero split height (approximately). I hope that I have made myself clear now. On the contrary, you are absolutely right that using it as an arbitrary threshold is not qualitatively different from single-locus thresholds, gdi, and what you like.

 

So why should you use DISSECT/STACEY/Speedemon if you don’t want to delimit ideal populations? I think the main advantage is that you can infer multilocus phylogenies under fully parameterized stochastic models without having an a priori knowledge on to which populations you should assign your samples. It should be obvious that using traditional taxonomic species delimitations usually would be a bad idea, as those usually are expected to be more inclusive that local, perfect och near perfect populations. But if you define species as the branches of an MSC tree, you can certainly also delimit them with empirical data. Future will tell if we can model the branches more like what we perceive species as, or if they are better view as clades in the phylogenetic tree framework, or something more elaborate.

 

So, in summary, epsilon should be used as an approximation of zero, and even if it technically can be used as an arbitrary threshold (which would not be an approximation of zero), such a threshold has no theoretical justification. It seems to me that you and others have interpreted “species delimitation” as something which should be applied to real Mayrian entities in nature. Personally, I lean towards the Darwinian interpretation of the word, a convenient taxonomic rank among others.

 

 

 

Luke Baton

unread,
May 9, 2025, 4:50:14 PM5/9/25
to beast-users
Thanks for taking the trouble to explain things, Bengt.

OK, we agree about, and have the same understanding of, many things (except what species are!). I'm open-minded about what species are, but for the moment I am species realist (at least for dioecious organisms using the Mayr's BSC).

However, there are some things that I still do not understand about what you are saying with regard to the purpose of epsilon. I get what you are saying about epsilon not being arbitrary because it should be as low as possible, but I do not understand how this fits with the purpose of DISSECT/STACEY as a species delimitation method. I will re-read the DISSECT/STACEY/SPEEDEMON articles again over the weekend, and, if you will be kind enough to answer, then ask you some questions in order to clarify the things that I still do understand. 

Reply all
Reply to author
Forward
0 new messages