Convergence problems

478 views
Skip to first unread message

samw...@gmail.com

unread,
Jan 12, 2021, 12:12:24 PM1/12/21
to beast-users
Dear BEAST community

I am struggling to get my BEAST analyses to converge and was wondering if somebody has a pointer for me how to best diagnose/solve these issues.

I have ddRADseq sequence data from 161 bottlenose dolphins from two, potentially three, species from around Australia with a total alignment length of 2.2 million bp (~35700 variant sites) which I'm analyzing as a single partition.

For the past set of runs, I have set one lognormal MRCA prior on all samples at around 2.3 million years ago based on estimates from the literature and am using a GTR+G substitution model with 4 rate categories and empirical base frequencies. As a tree prior I'm using coalescent constant population with default settings for those priors. While a single analysis usually does converge for all parameters, independent analyses converge at different values for posterior, likelihood and treeLikelihood when run for chain lengths of 50'000'000 and also in longer runs of 100'000'000. I have also specified a lognormal prior for the strict molecular clock rate at around 0.0014 based on values obtained from the literature on cetaceans.

The topology of the trees obtained does make complete biological sense and is largely consistent across runs, however, the 95HPD of the node age for the MRCA is very large.
In an effort to improve this, I have run the same analysis with the CoupledChains module and running two chains. It does appear to reach convergence slightly earlier like this, however, it looks like I'm about to face the same convergence issue here although the run hasn't finished yet. See attached Tracer screenshot.

Specifically, my questions are as follows:
- One part I am very confused about are the tree prior. Is coalescent constant population appropriate for something like this? Can someone give or point me towards guidance as to how to decide on which tree prior to use?
- Given that the 95HPD around the MRCA is massive I am wondering if I misspecified this prior. Currently, I have it specified as a lognormal at 2.3 (mean in real space) with S 0.6. Should I maybe specify this as a uniform distribution instead?
- could this be improved by adjusting some of the operators? If yes, which operators would this be?
- the trace in the attached screenshot looks very different from what I obtained when running without CoupledChains with these downward outliers in posterior, likelihood and treeLikelihood. Is this normal or is this something to worry about (I suspect the latter)

Thank you very much in advance and best wishes
Sam
tracer.png

aditi sharma

unread,
Jan 12, 2021, 2:37:06 PM1/12/21
to beast-users
Sam,
     From what I understand of your problem here is what might help
1. Before you put your data on Beast, run it through Tempest. It helps a great deal. Create a ML tree and open it with tempest-remove the outliers. For details-check Tempest manual.  
2. Run your data for longer mcmc. I usually run a 100 million chains.
3. If you are running 3 different species in one dataset then convergence issues are normal. But with a good dataset (tight residuals) you can get the desired results.

Hope that helps!

Carlo Pacioni

unread,
Jan 14, 2021, 6:49:44 AM1/14/21
to beast...@googlegroups.com
Hi Sam,
by using the coalescent costant pop size tree prior, you are making the prior assumption that the population size of the organism that you are trying to model has been constant over time AND that the samples you have collected are from a panmictic population (i.e. the same pop), which clearly is not true as you say that you have samples from three different species. While there are cases where it might be legitimate to use coalescent tree priors and, in my experience, the tree topology doesn't change much, I'd argue that you have to have a good reason to go ahead with such an approach. So really comes down to what questions you are trying to answer. If you are after some kind of phylogeographic analysis, this may be acceptable, but you could also consider a structured coalescent with each species as one deme. If you are interested in looking at the degree of separation and divergence time between these species, a speciation prior (Birth-Death) may be also an option. If you are interested in the demographic status of each of these species (although it doesn't sound that this is the case), then there are better approaches. 
That being said, I'd also consider whether it is really necessary to link all your fragments into one partition as this may lead to inaccurate results since most likely these are actually independent loci. As I suspect you are interested in phylogenetic analysis you may want to consider using a species tree approach. This will probably take a very long time to converge, but you could consider dropping the fragments that only have 1 or 2 SNPs, since these contribute very little information.

Keep in mind that there may be interactions between the tree prior and your tree calibration (see Drummond and Suchard 2008) and it may be worthwhile to run an analysis sampling from the priors to make sure that the calibration reflects what you wanted. Also, in absence of heterochronus data, your calibration on the root and prior on the clock are the only source of information for the analysis. I'm not sure what you intend with very wide HPD, but your calibration on the node would allow the root to move up to ~6M years ago. Is this what you are seeing?
The prior you use should reflect the knowledge you have and most of the time a lognormal calibration on an internal node is appropriate because we usually think to know when something happened (higher prob mass around the mean), but we want to allow for some uncertainty, so I wouldn't change it to another prior (e.g. uniform) unless you have good reason to do so (e.g. you came across new information).

There is no fix/magic number for how long a MCMC should be, that changes for each dataset and type of analysis combination, but you certainly have a big dataset and I wouldn't be surprised if you need much longer runs to get convergence (from the plot you posted both runs are too short).

you don't say whether you are using BEAST1 or BEAST2, but both allow for tuning of operators. BEAST2 will spit out some suggestions to this end if it detects the need from some manual adjustments, but BEAST1 also reports an analysis of the operators that you can look at to decide whether some need adjustments. Most of the time the autotuning that BEAST has inbuilt does a pretty good job, so I would first worry about making sure you have a proper set up before playing around with the operators, but happy to give you some pointers in another email if you think you need to adjust them.

Good luck,
carlo



--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beast-users/3191219b-ca14-4312-af89-c759a5998d2dn%40googlegroups.com.

samw...@gmail.com

unread,
Jan 18, 2021, 1:01:18 PM1/18/21
to beast-users
Dear Carlo, dear Aditi

Thank you very much for your extensive feedback! I am using BEAST2 and I must admit I was not aware that the coalescent tree prior assumes a panmictic population. I have now started two analyses with a birth-death prior and as far as I can tell thus far, they appear to be converging to similar values. However, the runs are not finished yet (~20 million out of 100 million) so I'll see in the coming days if this improved the run and if I need to make them longer. Currently, I'm again running it as one partition to see if it improves the results, I will try a partitioned analysis right after this.

The calibrated node quite consistently recovers the split time from the literature (all runs I've done thus far end up somewhere around 2 million years for the MRCA) for that node. As for sampling from the prior I am uncertain what actually is expected: Do I expect to recover the distribution I specified in the prior if there is no interaction between tree prior and calibration?

Aditi, I am unfamiliar with Tempest. Thank you for the recommendation, I'm reading up on this now!

Cheers
Sam

sehrish kakakhel

unread,
Mar 10, 2021, 1:00:22 AM3/10/21
to beast-users
hi aditi sharma i am trying tempest. it give me blue and red highlighted sequences. whether i have to remove only blue sequences or red one also? kindly guide me
Reply all
Reply to author
Forward
0 new messages