*BEAST convergence problems

700 views

Skip to first unread message

Jim McGuire

unread,

Apr 25, 2012, 5:58:06 PM4/25/12

to beast...@googlegroups.com

Hi All,

I have been running *BEAST analyses with fossil calibrations for a data set composed of 134 exemplars representing 45 species (at least two individuals for all but 1 ingroup taxon). The data set includes 9 loci with virtually complete coverage for each exemplar. I have struggled to obtain proper convergence with this data set and have tried all sorts of adjustments in an effort to resolve the problem, but to no avail. I have been running these analyses for 1 billion generations. I am using two calibrations and I have set up analyses with both calibrations (one young, one old), other analyses with just the old or just the young calibration, and some that are not calibrated at all. I am using a uniform prior for the deeper calibration with a range of 24-36 million years, and I have used a lognormal distribution with a mean in real time of 2.5 million years for the younger calibration. The problems that I have encountered include (1) posteriors that decline in value over time, with the shape of the trace closely matching those of the prior and the species.coalescent, and (2) recovered species tree topologies that actually make good sense, accept that the calibrated nodes are highly compressed toward the tips (as if there is almost no divergence among the members of calibrated clade) such that the scale for all other branches on the tree are gigantic. For example, in calibrated analyses in which the older calibration node has a prior distribution with a mean of 30 million years, the chronogram will reflect this age for this node, but the rest of the nodes, which appear much deeper in the tree, necessarily have really old node ages. In this case, the root of the tree will have a divergence estimate of 5 billion years, etc. The bizarre compressed node phenomenon only occurs for nodes that are tied to a calibration. In other analyses, in which the node in question is not used for calibration, the branch lengths within the group are much longer as expected. Also, in an earlier batch of analyses with a slightly less complete version of this data set, the posterior probability trace climbed throughout the analysis, with the branch lengths associated with the calibration nodes starting off compressed and then growing progressively over the course of the analysis. Unfortunately, in those analyses, 500 million generations wasn't enough to allow the tree shape to stabilize.

I have run analyses on each of these permutations of the data set (with different calibrations) with empty alignments. I wasn't able to see anything in these analyses that jumped out at me as an explanation for my bizarre results, but this may simply be an issue of my not knowing what I should be looking for.

I've been wondering if this might have something to do with an interaction between population size estimates and branch lengths, so I have tried running analyses using the 1/x prior for species.pop.mean as we all as an invariant gamma (which solved a colleague's convergence problems). This appears not to have made much difference, although the invgamma analyses are still running.

Any advice would be greatly appreciated!

Jim

Jim McGuire

unread,

May 7, 2012, 12:51:43 PM5/7/12

to beast...@googlegroups.com

Hi everyone,

I'm not sure if the problem that I identified in my prior post is a general one, especially since I received no posts in response (but thanks to the few of you who sent me direct comments/condolences). I managed to solve my problem and I figured I should share this info just in case I am not alone in experiencing this issue. To quickly restate the problem, I was running *BEAST analyses and I found that whenever I attached a calibration to a particular node, that node would wind up highly compressed on the inferred species tree (in other words, the entire clade was inferred to be much younger than expected relative to all or most other inferred clades). Consequently, whereas I might expect that a calibrated node would be 30 million years old and the root of the tree might be 60 million years old, the calibrated node would have an age of 30 my as expected, but the root node would have an inferred age of 500 million years old (or older) and this would apply to all of the other non-calibrated nodes as well. Also, the posterior probability distribution trace would climb initially, but then fall dramatically after which it might or might not stabilize.

One thing I neglected to mention in my prior post was that this data set will only run if I employ realistic starting trees (otherwise I get the typical likelihood=-inf error). I was using a set of starting trees for each gene as well as a starting species tree. It now appears that this unexpected/pathological behavior was caused by some sort of interaction between the calibration prior and the starting gene trees that I was using (despite that the branch lengths were reasonable approximations for the overarching species tree). Once I deleted the starting gene trees - only the starting species tree was essential - and just allowed the starting gene trees to be selected randomly under the coalescent process, both problems that I described above were resolved. I might add that, for those of you who are just getting started with *BEAST with larger data sets, you should expect these analyses to take a long time to converge even when the analysis is working properly. All of my analyses take many tens of millions (sometimes hundreds of millions) of generations just to find a reasonable approximation of the species tree and escape the "spaghetti tree zone" (the negative branch length zone), and thereafter it usually takes additional hundreds of millions of generations for the node ages to converge. My current posterior traces reflect a slow "uphill" slog. Perhaps this is something that can be resolved with better operator tuning, etc., but after months of fooling around with these data trying to get acceptable performance, I'm happy/thrilled to wait it out.

Finally, as I mentioned before, neither of the problems described above apply to my non-calibrated analyses, which have really beautiful convergence properties for all parameters..

Anyhow, if this is a problem unique to my own data set, I apologize for cluttering up the user-group.