How to pick runs with the highest maximum likelihood

324 views
Skip to first unread message

Rafal Mostowy

unread,
Oct 20, 2014, 7:35:20 AM10/20/14
to structure...@googlegroups.com
Hi,

I'm running structure on a dataset for a range of values of K and 20 iterations for each K. The problem is that many runs seem to get stuck on a local maximum as there is an enormous variance in the log-likelihoods between different runs. My question concerns finding the runs with the highest likelihood as I wanted to restart them later using the option STARTATPOPINFO=1.

I'm attaching two examples, which illustrate my confusion. First, I compared the 'Ln Like' chain, which I plotted in two figures (see attached PNG). As you can see, mcmc2.pdf has a much higher value than mcmc16.pdf, suggesting that mcmc2 has approached a much higher value.

Conversely, when I compared the output_f files, the results suggest something different.

The file output2_f has the following summary statistics:
--------------------------------------------
Estimated Ln Prob of Data   = -451648.6
Mean value of ln likelihood = -20698.1
Variance of ln likelihood   = 861901.0
Mean value of alpha         = 0.3619
Mean value of r              = 0.0009
Standard deviation of r    = 0.0203

while the file output16_f has the following summary statistics:
--------------------------------------------
Estimated Ln Prob of Data   = -283633.9
Mean value of ln likelihood = -18983.1
Variance of ln likelihood   = 529301.7
Mean value of alpha         = 0.2454
Mean value of r              = 0.0011
Standard deviation of r    = 0.0241

This in turn suggests the run 16 was a much better run.

I'm running the results on a cluster, which prevents me from running longer MCMC chains (I'm already doing 800,000 + 300,000 iterations). Thus, I need to make sure that the runs converge on a global maximum before I analyse the results and estimate the best value of K.

Help would be highly appreciated!
R

2runs.png

Vikram Chhatre

unread,
Oct 20, 2014, 9:45:34 AM10/20/14
to structure-software
Rafal -

What K are these two runs from?  If you have done, all 20 runs on all K's you wanted to test, can you prepare a lnPD plot and show that to us as well?

V

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To post to this group, send email to structure...@googlegroups.com.
Visit this group at http://groups.google.com/group/structure-software.
For more options, visit https://groups.google.com/d/optout.

Rafal Mostowy

unread,
Oct 20, 2014, 9:51:00 AM10/20/14
to structure...@googlegroups.com
Hi Vikram,

I've analysed the full set of runs at STRUCTURE Harvester, and it should be accessible for the next few days.
http://taylor0.biology.ucla.edu/structureHarvester/completedJobs/sparkling-ridge-c769/summary.html

The problem I have with interpreting those results is that if some iterations do not converge on a global maximum, it will result in a much lower likelihood value, and hence an impression that higher values of K do not explain the data well. Instead, it seems to me I should make sure that all runs have converged, and to do this I need to compare the likelihood value. Hence my original question.

Thanks a lot,
R
> You received this message because you are subscribed to a topic in the Google Groups "structure-software" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/structure-software/t9dJwihPTcA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to structure-softw...@googlegroups.com.

Vikram Chhatre

unread,
Oct 20, 2014, 9:58:27 AM10/20/14
to structure-software
Rafal -

Generally speaking, the variance tends to increase once an optimal K has been reached. This is what I think you are seeing with your data.  Notice how K=2 runs have very small variance compared to the higher K's. 

What is the biological expectation of population structure for your data?

Rafal Mostowy

unread,
Oct 20, 2014, 10:08:18 AM10/20/14
to structure...@googlegroups.com
Hi Vikram,

The biological expectation is that there is high mosaicism of the sequences I'm analysing due to recombination. I can see it using BAPS/BRAT and I wanted to verify it using STRUCTURE. In fact, for the middle part of the alignment (which I analysed before and have now extended) STRUCTURE suggested K=3 is the best value to describe the population structure, and I found it to be a quite reliable estimate given what I know about the dataset. It's strange that now STRUCTURE returns K=1 as the most likely value with the wider alignment.

In any case, I still would like to understand why the mean value of ln likelihood is different from 'Ln Like' for K>1. Do you know how the mean ln is obtained?
Reply all
Reply to author
Forward
0 new messages