Bayes factors and differences in multiple runs of the same model

249 views
Skip to first unread message

Robert Kraus

unread,
Aug 31, 2012, 4:49:40 AM8/31/12
to migrate...@googlegroups.com
Hi!

I am working on a microsatellite data set (3 pops: 42, 106, and 40 individuals; 14 msats; a medium sized mammal). I'm having difficulties in getting good estimates for my parameters, no matter what I do, or how long I run the programme (wihtin reasonable time, say, a few days on an MPI machine with 15 cores to use one thread per locus plus a master). I'm trying myself to find out what the problem is - so I have not reported to the group yet. I will do that later, but during my invetigations I started to perform runs on the same data and settings multiple times (= full replicate runs). I thought, among the possibilities why migrate-n performs badly on my data and settings may be that my model is completely unrealistic. I decided to be fancy and do model testing, but have an interesting question now: the marginal likelyhood scores needed for Bayes factors are not identical between my replicate runs. I have specified five models, each with 10 replicate runs. Do I average scores over the runs of the same model? Is there a smart way to incoporate something like the standard deviation/error of such an average?

Cheers,
robert

Peter Beerli

unread,
Sep 4, 2012, 2:58:09 PM9/4/12
to migrate...@googlegroups.com
Robert,

could it be that your non convergence depends on the uneven number of individuals? Of course several other things may be problematic too, for example if the mutation rates are very different among your loci it may be difficult to get unimodal peaks, the option mutation=DATA may help in such cases as it 'tries' to scale the mutation rate effect.

Models that are far from the truth are often difficult to bring to convergence, and there may be even real scenarios that may be difficult too, migrate surely has difficulties to switch between scenarios that are likely for decisive A<--B and decisive for A-->B, but I have not seen a real dataset yet with such patterns.

I am not clear how you replicate, I often replicate within a run (say 10, or 100 times) this allows to combine over different MCMC trajectories and helps against stuck MC chains. I sometimes run several runs with the same settings to get an idea how variable the marginal likelihoods are, I would be inclined to compare the runs that have the highest marginal likelihood and not averages over replicates (of course this also assumes that you sample d from converged runs)

Peter
> --
> You received this message because you are subscribed to the Google Groups "migrate-support" group.
> To post to this group, send email to migrate...@googlegroups.com.
> To unsubscribe from this group, send email to migrate-suppo...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/migrate-support?hl=en.
>

Robert Kraus

unread,
Sep 7, 2012, 2:08:18 PM9/7/12
to migrate...@googlegroups.com
Hi Peter,

I haven't been aware that an uneven number of individuals could lead to non convergence. Is this often observed? I had the impression that - at least when you have a decent samples size also for the smaller pops - it doesn't hurt to include more individauls when one has them.

I have no idea about mutation rates. I should try mutation=DATA (and report if it helped)

I did full replicate runs, incl. burn-in etc. All the same settings, run from different directories. So a combination of runs is not possible, is it? I would need to check how I set migrate-n to replicate runs within a run anyway. Have no idea right now how this works, but sounds great. Does it then also benefit from mpi? Cause I have a big cluster at hand. I thought this long-chain, small-chain setting is for max. likelihood. Maybe you could quickly point out hor to set a replicte run benefitting from mpi? If too much work, please don't spend time on it. I should find out some time. I'm just in a rush right now...

Cheers,
robert


-------- Original-Nachricht --------
> Datum: Tue, 4 Sep 2012 14:58:09 -0400
> Von: Peter Beerli <beerli...@gmail.com>
> An: migrate...@googlegroups.com
> Betreff: Re: [migrate-support] Bayes factors and differences in multiple runs of the same model

Peter Beerli

unread,
Sep 7, 2012, 5:32:57 PM9/7/12
to migrate...@googlegroups.com, Robe...@gmx.li
Robert,

we read always about convergence but most users misappropriate this. You should sample from some equilibrium therefore the chain has to be in convergence, results will be crummy if convergence is reached at the end of the run, preferably you want to reach it in the burn-in.

Uneven number may not converge quickly -- non-convergence is a unhelpful term and should not be really used because we should distinguish between "not converging however long we run" or "not converged in x steps", the first is difficult to prove, well I bet that almost all uneven sample sizes converge eventually, but think of an extreme thought example:
2 population, one with 10000 samples and one with 2. Lets further assume that migration rate is almost zero
so that the two populations have one migration event, if we start with a random tree changing one branch at a time and we have 19999+2 branches to choose from then almost all tree changes will happen on the first population and only few in the second ---> the population size estimate for pop 2 will be terrible, if you run this long enough then this will work out fine. In this case results would be much easier to get with random 10 out of the 10000 and the 2. Differences such as 20 and 10 should make little problem in terms of convergence, but I guess that 100 and 10 will need simply longer runs.

I guess that mutation=DATA will help.

For replication within a run us the
replication=YES:10
option {10 is an example -- any number > 1 is good}.
Replication DOES benefit from MPI, replication has nothing to do with the short/long chain of ML where replication can be set as complete replicates (like above) or combining over the last (long) chain [I think that this option is not as optimal as the full replicates)

I will expand my tutorial page on hardware issues (which I never finished) and give a few explanations.
In short: if you have a cluster with 100 nodes, a dataset with 10 loci and you can use the full machine I suggest to
run 10 loci and 100 replicates (use a long burn-in such as 100000 steps or larger
and sample relatively little (say 5000), the long burn-in are needs to achieve convergence and then sample from it.
[my largest trial with migrate were 10000 loci 1 replicate, 1000 loci 10 replicate , 10 loci 1000 replicate on our cluster, largest number of node was 512 {runs on 1024 nodes failed, but I was not able to trouble shoot that because of problems with the hardware; I am currently tracing an issue with MPI and migrate on windows that leads to crashes that I cannot reproduce on a mac. so for runs with many reps and loci on large clusters try a small example (and short example) first to see whether migrate can finish and then scale it up, because when it crashes on these machines it is very difficult to find out what is going on.

Peter

Peter Beerli

unread,
Sep 20, 2012, 9:56:36 PM9/20/12
to migrate...@googlegroups.com
Eric,

1000 units sound large (even for microsatellites and short runs)
replicates are combined per locus and are not presented independently, the are correctly summed [averages would be wrong]

- No you cannot simply combine the single-locus marginal likelihoods (the overall estimate is the sum over all loci multiplied by a scaling factor (for details see appendix in Beerli and Palczewski 2010)
- No you cannot calculate the averages from the bayesallfile for this,
- I assume that your runs or your datafile are incorrect because reporting log mL with 50 digits sounds like the rescue -HUGE value. Send me the outfile.pdf  and I can check.

Peter




On Sep 20, 2012, at 9:41 PM, Eric <eric.d...@gmail.com> wrote:

Regarding this, I am running a 6 microsatellite dataset on a marine mollusk with 4 populations. I'm doing 5 replicate runs (replication=YES:5), and attempting to do some model selection with the output. Thermodynamic marginal likelihoods for individual loci are somewhat similar among replicates, but do not converge - usually they are different by up to 1000 units of log-likelihood.

My first set of questions is: what are the numbers that are given in the PDF outfile for these marginal likelihoods? Are the values that are given for each locus averaged over replicates? Is it better to use these values, or to take the mean of the ML_thermo parameter in the Bayesallfile?

Second question: The outfile gives some ridiculously long number (~50 digits) for the marginal likelihood over the whole dataset. I figure there is a bug here, but how can I summarize over the whole dataset? My guess is just to add the LnL values for each locus (thus multiplying the likelihoods themselves). Is that correct?

I will try mutation=DATA. Do you have any other suggestions that might help the replicate runs converge?

Many thanks for your help,

Eric
To view this discussion on the web visit https://groups.google.com/d/msg/migrate-support/-/oV3bYq4UBRgJ.

Robert Kraus

unread,
Sep 27, 2012, 2:37:16 AM9/27/12
to migrate...@googlegroups.com
Hi Peter,

after some playing around I'm still not getting consistent results. I'll post you the infiles, parmfiles, and outfiles to your fsu address. Maybe I'm doing something really stupid and simply don't get it.

Basically, what I tried:

- mutation=DATA
- harmonising sample sizes by randomly excluding individuals from the two larger pops until all pops were of same samples size
- I found one additional problem: I realised that in the input file that I use allele lenghts of msats were entered in such a way that the smallest allele was named 01, and the next 02 and so on. This is a problem, right? Because the mutation model cannot let the alleles mutate below one repeat... I should have picked that up earlier, but alas... I tried long runs with an input file in which I added 10 to each allele size. So 01 was then 11 etc. Is this valid? Anyway, it didn't help, either.

A last remark on replication: I haven't tried this. If I understand correctly replication is not done during burn-in? That is, when I have 10 loci and would normally do 11 nodes, I could set replication to 10, so that AFTER burn-in 100 nodes could work at a time (plus the one master node)? But if I set the run to start on 101 nodes 90 will be idle during burn-in? Just for my understanding.

This data set obstructs me longer than I wished... But maybe I am just spoiled by other sets I had so far, and the problem is just in the data. We know that error rates in genotyping are present. I can only set error in sequences, right? Otherwise I would have entered an error rate for my msat genoytpes already.

Cheers,
Robert

-------- Original-Nachricht --------
> Datum: Fri, 7 Sep 2012 17:32:57 -0400
> CC: Robe...@gmx.li
Reply all
Reply to author
Forward
0 new messages