Genus-level biogeographic analyses and incomplete sampling

David Ortiz

unread,

Jul 23, 2024, 5:57:34 AM7/23/24

to BioGeoBEARS

Hello,

I am running BioGeoBEARS to determine the deep biogeography of a large family of sedentary organisms with a worldwide distribution. The family comprises over 1000 species. We have inferred its deep phylogeny using one specimen per genus for 77 genera, covering almost the entire genus-level diversity of the family. We have defined large landmasses as the biogeographical areas for coding, and 70 out of 77 genera live in only one area.

I am aware that it is not recommended to code genera instead of species (http://phylo.wikidot.com/biogeobears-mistakes-to-avoid#genus_trees), but coding species is also not ideal because it will miss some of the biogeographical processes and will definitely tell a simpler story by missing some biogeographic events.

In practice, we have run analyses coding the tips with both the distribution of species (the one sequenced) and the distributions of the genera. The best model for both resulted in the DEC+J model. I am attaching the results. As you can see, the results are very similar, but uncertainty towards some deep nodes is much higher in the genus approach than in the species approach, leaving unresolved the biogeographic origin of the family. This happens mainly because the model gives credibility to some of these ancestors living in more than one area.

Given that none of the known >1000 species of the family occupies more than one area currently, the areas are separated by oceans, and the phylogenetic history of the group is not that old, I think it is unlikely that any ancestor lived in more than one area. Do you think that by coding in the terminals the distribution of genera, and because the model assumes that the terminals are species, we are pushing BioGeoBEARS to give credibility to ancestors living in more than one area, increasing the uncertainty of our analyses?

In this context, would it make sense to report in the article the results of both species-coding and genus-coding analyses, and focus our conclusions on the areas of the reconstruction where they give the same results, or do you think that one of these configurations is openly inappropriate and shouldn't be given credibility?

On the other hand, it is also apparent that the model assumes that all species in the group are sampled. This assumption is severely violated in our analyses, given that we only sampled 77 out of >1000 species in the family. In this context, does it make sense at all to do some biogeographic analyses? It is worth mentioning that we are just interested in the distribution probabilities of all ancestors in our analyses, and in BSM to determine the main biogeographic events, not specifically in the values of the estimated parameters (dispersal, extinction, vicariance, etc). Does very incomplete taxon sampling likely affect the range probabilities of the ancestors, or just the values of the parameters estimated?

If, apart from your response, you can offer some published references comparing or discussing these issues, that would be great.

Thanks in advance,

David

11-MCMCTREE_DEC+j_species.png

11-MCMCTREE_DEC+j_genus.png

Ivan Magalhães

unread,

Jul 23, 2024, 6:50:39 PM7/23/24

to bioge...@googlegroups.com

Hi David, two thoughts on this:

"I think it is unlikely that any ancestor lived in more than one area."

You can test this by forcing ranges to be composed of no more than two areas by changing the parameter BioGeoBEARS_run_object$max_range_size.

(I am not sure you can set this to 1; perhaps you can but in this case, it would only make sense if all dispersal events happen in the nodes, as founder-event speciation events, rather than along branches, since anagenetic dispersal necessarily passes through a 2-area range). (also, it would preclude you to having terminals that occur in a number of areas larger than max_range_size!)

Regarding taxon sampling incompleteness, if the missing taxa represent otherwise unsampled lineages that would bring novel biogeographical information, this will surely impact the ancestral range estimation (in addition to the dispersal and extinction parameters). I am not sure about solving this by coding areas for each genus -- this sits on the premise that the genera are monophyletic, at least. So I'd think twice before doing this.

Please note that in ancestral range estimation as done in BioGeoBEARS and similar programs (Lagrande, DIVA, etc), it is not uncommon to observe widespread ancestors, because vicariant/allopatric events have cost 0 for the model (as opposed to dispersal and extinction events, that are penalized). So, solutions in which a widespread ancestor gave rise to endemic descendants through allopatric events are usually favored.

Best,

Ivan

--
You received this message because you are subscribed to the Google Groups "BioGeoBEARS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogeobears...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/biogeobears/277fa0f0-c41e-42b4-825a-92c3571cff80n%40googlegroups.com.

Julia Dupin

unread,

Jul 23, 2024, 6:50:39 PM7/23/24

to bioge...@googlegroups.com

Hi David,

reading your email I have a few questions that might help you decide if the phylogeny you have can indeed be used for a biogeography analysis.

My main question is what is the main question or hypothesis (maybe hypotheses) you actually want to answer/test with this analysis? You mention that "[you] are just interested in the distribution probabilities of all ancestors in our analyses, and in BSM to determine the main biogeographic events", but one thing that would make things less descriptive and more straightforward would be to actually determine what is hypothesized here. If there is a specific ancestral range or directionality expected then BiogeoBEARS would actually allow for testing that, for example.

My other main question would be if there is any chance of adding tips to this phylogeny. Not up to 1000 of course (what a dream that would be), but add enough tips so genera can now be better represented in terms of proportions relative to others.
Indeed, the current phylogeny you have was done with a different purpose in mind it seems ("We have inferred its deep phylogeny using one specimen per genus for 77 genera"), so it might be ill-fitted for answering biogeography questions.

Please let us know a bit more about your study.

Julia

--

You received this message because you are subscribed to the Google Groups "BioGeoBEARS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogeobears...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/biogeobears/277fa0f0-c41e-42b4-825a-92c3571cff80n%40googlegroups.com.

--

Julia Dupin, PhD

Message has been deleted

David Ortiz

unread,

Jul 24, 2024, 6:36:16 AM7/24/24

to BioGeoBEARS

Hola Ivan:

Gracias por tu respuesta. Responses below.

On Wednesday, July 24, 2024 at 12:50:39 AM UTC+2 ilf.ma...@gmail.com wrote:

Hi David, two thoughts on this:

"I think it is unlikely that any ancestor lived in more than one area."
You can test this by forcing ranges to be composed of no more than two areas by changing the parameter BioGeoBEARS_run_object$max_range_size.
(I am not sure you can set this to 1; perhaps you can but in this case, it would only make sense if all dispersal events happen in the nodes, as founder-event speciation events, rather than along branches, since anagenetic dispersal necessarily passes through a 2-area range). (also, it would preclude you to having terminals that occur in a number of areas larger than max_range_size!)

*** Actually, the analyses attached to the original email were done without enforcing a strict limit on the number of potential areas for the ancestors (I set it up to 3 because we had to specify a number). If you look at the analysis coding terminals as sampled species (not genera), you'll see that the probability of any ancestor occupying two or more regions is very low, even without enforcing it. Most of the events are inferred as cladogenetic events without range changes, and range changes are usually estimated as jump dispersal events. I mentioned in the email that I believe it is unlikely that any ancestor lived in two areas (given the current distribution of the species in the family) to point out that perhaps by coding terminals as genera (only 7 out of 77 genera currently occupy more than one area), we might be promoting the genus-based models to infer ancestors in multiple areas.

Regarding taxon sampling incompleteness, if the missing taxa represent otherwise unsampled lineages that would bring novel biogeographical information, this will surely impact the ancestral range estimation (in addition to the dispersal and extinction parameters). I am not sure about solving this by coding areas for each genus -- this sits on the premise that the genera are monophyletic, at least. So I'd think twice before doing this.

*** Yes, actually in the wiki help page of BioGeoBEARS, one of the potential solutions provided is to code the sampled species, not genera (http://phylo.wikidot.com/biogeobears-mistakes-to-avoid#genus_trees), so we plan to take the species-based analyses as the main results. However, given that the genus-level analyses provide very similar estimations for most nodes, the question is if it makes sense to report and discuss them as well. If there is good reason why genus-coded results are meaningless, then there is no point in reporting them.

Please note that in ancestral range estimation as done in BioGeoBEARS and similar programs (Lagrande, DIVA, etc), it is not uncommon to observe widespread ancestors, because vicariant/allopatric events have cost 0 for the model (as opposed to dispersal and extinction events, that are penalized). So, solutions in which a widespread ancestor gave rise to endemic descendants through allopatric events are usually favored.

Best,
Ivan

Saludos

David Ortiz

unread,

Jul 24, 2024, 6:39:16 AM7/24/24

to BioGeoBEARS

Hi Julia:

Thanks a lot for your answer.

Actually, there are a pretty limited number of biogeographic events in the history of this family (images attached to the previous message). The events are spread across the biogeographic history of the group, and they involve different areas, so no clear patterns are apparent. Instead, it seems that the events were sporadic. Therefore, on the side of hypothesis testing, we did not find anything very promising.
However, this is the first estimation of the diversification timeline of the family, and each of the biogeographic events had a profound impact on the diversification of the group, leading to radiations in terms of genera and species (which we cannot test appropriately given the nature of our sampling of one species/genus). Therefore, our main biogeographic results are likely to be how continental drift shaped the group's evolutionary history.

Concerning your other question, unfortunately, a deeper sampling is out of reach for now. This is a UCE-based phylogeny, and there is not even Sanger-markers information available to add of multiple other species.

Cheers,

Reply all

Reply to author

Forward