Low ESS for some parameters, does it matter?

826 views
Skip to first unread message

Nicoletta Commins

unread,
Mar 26, 2021, 1:19:15 PM3/26/21
to beast-users
I am trying to run BEAST to get a time tree for ~350 bacterial genomes with ~700,000 SNPs each. I'm trying separate runs with both a constant population size and Bayesian skyline tree prior, and I have fixed the ucld mean rate to a previous estimate, and fixed the starting tree topology to my ML tree. In both cases, I'm getting ESS >200 for some parameters, including the likelihood and treeLikelihood, but very low ESS for others. So far I have only run 20mil/200mil states. I'm wondering a) if the ESS are this low at this point in the run, should I just wait or is it a bad sign for the run as a whole? and b) Which parameters do I "need" to have a good ESS in order to get a good sample of dated time trees? Any other suggestions would also be appreciated. I haven't tried changing any operator weights or tuning parameters yet.

Screen Shot 2021-03-26 at 10.16.17 AM.pngScreen Shot 2021-03-26 at 10.16.37 AM.png

Pratanu Kayet

unread,
Mar 26, 2021, 1:40:14 PM3/26/21
to beast...@googlegroups.com
What is the chain length you had use??

On Fri, Mar 26, 2021, 22:49 Nicoletta Commins <naco...@g.harvard.edu> wrote:
I am trying to run BEAST to get a time tree for ~350 bacterial genomes with ~700,000 SNPs each. I'm trying separate runs with both a constant population size and Bayesian skyline tree prior, and I have fixed the ucld mean rate to a previous estimate, and fixed the starting tree topology to my ML tree. In both cases, I'm getting ESS >200 for some parameters, including the likelihood and treeLikelihood, but very low ESS for others. So far I have only run 20mil/200mil states. I'm wondering a) if the ESS are this low at this point in the run, should I just wait or is it a bad sign for the run as a whole? and b) Which parameters do I "need" to have a good ESS in order to get a good sample of dated time trees? Any other suggestions would also be appreciated. I haven't tried changing any operator weights or tuning parameters yet.

Screen Shot 2021-03-26 at 10.16.17 AM.pngScreen Shot 2021-03-26 at 10.16.37 AM.png

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beast-users/d373b36b-1d1b-4268-aabe-94b9d68a2682n%40googlegroups.com.

Nicoletta Commins

unread,
Mar 26, 2021, 1:54:02 PM3/26/21
to beast-users
I set the chain length to 200million but it has only reached about 30 million now so it may improve if I just wait. But it's quite slow so I'm wondering if it's wise to just wait or if there are already some red flags that I should try to fix.

Pratanu Kayet

unread,
Mar 26, 2021, 1:59:54 PM3/26/21
to beast...@googlegroups.com
What kind of  substation you used??and what is your gama parameter??
Can you also share a screenshot of the streps you followed in BEAUTi??

Pratanu Kayet

unread,
Mar 26, 2021, 2:19:47 PM3/26/21
to beast...@googlegroups.com
Low ESS matters...it's besically show how well your analysis had gone..ess <100 will consider as poor analysis..

Lambodhar Damodaran

unread,
Mar 27, 2021, 1:19:39 PM3/27/21
to beast-users
Hi Nicolletta,
So since it is early in the run you should give it time, for a run of 200 million usually the first 20 million would be the standard burn-in and I wouldn't really evaluate it too heavily right now. You should definately have two more runs at the very least, you can combine these runs and this will help with your ESS value. A poor ESS value means that a parameter has a poor posterior probability distribution. Values that have poor ESS values at the end of the run need to be evaluated for proper priors, proper assumptions, and you can tweak the operators on the next run.

When you say fixed for the ucld.mean what kind of distribution did you set for it? Also, are the genomes sampled closely in time or is there a wide temporal range?
Don't know if this was helpful... but I think just give it more time and have multiple runs and then reevaluate.
Best,
Lambo

On Friday, March 26, 2021 at 1:40:14 PM UTC-4 pratan...@gmail.com wrote:

Tzu-Hao Kuo

unread,
Mar 29, 2021, 1:27:16 PM3/29/21
to beast...@googlegroups.com
Hi Nicolletta,

The tree heights looked like a sign to what I encountered previously. My solution was to add the counts of invariant (constant) sites in the XML file (please refer to https://groups.google.com/g/beast-users/c/QfBHMOqImFE)

Best regards,
Tzu-Hao

Nicoletta Commins

unread,
Mar 29, 2021, 1:27:16 PM3/29/21
to beast-users
Hi Lambo,

This dataset has genomes sampled closely in time so does not show temporal signal. I'm using a separate clock rate estimate instead and fixing the mean clock rate to date the tree rather than having beast estimate it. The distribution is log normal. I pasted this section of the xml file below (with some of my file names x'ed out). 

I'm wondering if it makes sense that the chain will need to "solve" some values before it can effectively sample other values, ie it needs to "solve" the transition/transversion rates and frequency parameters before it can mix well for other values like tree height?

Thanks!


                    <branchRateModel id="RelaxedClock.c:20200731_mab_upid_droppedOutliers" spec="beast.evolution.branchratemodel.UCRelaxedClockModel" rateCategories="@rateCategories.c:xxxxxxx" tree="@Tree.t:xxxxxxxx">
                                            
                        <LogNormal id="LogNormalDistributionModel.c:xxxxxxx" S="@ucldStdev.c:xxxxxxx" meanInRealSpace="true" name="distr">
                                                    
                            <parameter id="RealParameter.78" spec="parameter.RealParameter" estimate="false" lower="0.0" name="M" upper="1.0">1.0</parameter>
                                                
                        </LogNormal>
                                            
                        <parameter id="ucldMean.c:xxxxxxx" spec="parameter.RealParameter" estimate="false" name="clock.rate">1E-7</parameter>x
                                        
                    </branchRateModel>



Lambodhar Damodaran

unread,
Mar 29, 2021, 2:18:34 PM3/29/21
to beast-users
Hi Nicoletta,
Gotcha, don't know if it would help tremendously but changing uniform distribution between 0 and 1, and using an initial rate as that mean clock rate that you used might help with converging on the true rate a bit easier. And as far as I understand when it comes to mixing the values associated with substitution models are typically very easy to solve because the alignment is static and the weights of the estimated parameters for that model are really simple to estimate so the MCMC converges very quickly. I think that shifting the operator weights to the values that are not converging as well, based on the recommendations in the .ops file after your run is done will be helpful. Also, the suggestion to set invariant sites by Tzu-Hao might be a useful step as well if you know there are invariant sights in your alignment but I would assume that because your using 700K SNPs it probably doesn't have many. Out of curiosity, are you running these on a computing cluster or on your own machine?

I hope that anything of this jumbled mess of advice was helpful ... my experience is more in virus phylodynamics but I think a lot of principles still apply. Goodluck!
Lambo
Reply all
Reply to author
Forward
0 new messages