Hello,
I am running BioGeoBEARS to determine the deep biogeography of a large family of sedentary organisms with a worldwide distribution. The family comprises over 1000 species. We have inferred its deep phylogeny using one specimen per genus for 77 genera, covering almost the entire genus-level diversity of the family. We have defined large landmasses as the biogeographical areas for coding, and 70 out of 77 genera live in only one area.
I am aware that it is not recommended to code genera instead of species (
http://phylo.wikidot.com/biogeobears-mistakes-to-avoid#genus_trees), but coding species is also not ideal because it will miss some of the biogeographical processes and will definitely tell a simpler story by missing some biogeographic events.
In practice, we have run analyses coding the tips with both the distribution of species (the one sequenced) and the distributions of the genera. The best model for both resulted in the DEC+J model. I am attaching the results. As you can see, the results are very similar, but uncertainty towards some deep nodes is much higher in the genus approach than in the species approach, leaving unresolved the biogeographic origin of the family. This happens mainly because the model gives credibility to some of these ancestors living in more than one area.
Given that none of the known >1000 species of the family occupies more than one area currently, the areas are separated by oceans, and the phylogenetic history of the group is not that old, I think it is unlikely that any ancestor lived in more than one area. Do you think that by coding in the terminals the distribution of genera, and because the model assumes that the terminals are species, we are pushing BioGeoBEARS to give credibility to ancestors living in more than one area, increasing the uncertainty of our analyses?
In this context, would it make sense to report in the article the results of both species-coding and genus-coding analyses, and focus our conclusions on the areas of the reconstruction where they give the same results, or do you think that one of these configurations is openly inappropriate and shouldn't be given credibility?
On the other hand, it is also apparent that the model assumes that all species in the group are sampled. This assumption is severely violated in our analyses, given that we only sampled 77 out of >1000 species in the family. In this context, does it make sense at all to do some biogeographic analyses? It is worth mentioning that we are just interested in the distribution probabilities of all ancestors in our analyses, and in BSM to determine the main biogeographic events, not specifically in the values of the estimated parameters (dispersal, extinction, vicariance, etc). Does very incomplete taxon sampling likely affect the range probabilities of the ancestors, or just the values of the parameters estimated?
If, apart from your response, you can offer some published references comparing or discussing these issues, that would be great.
Thanks in advance,
David