Re: Partitionfinder and clock model in BEAST

1,829 views
Skip to first unread message

Rob Lanfear

unread,
Jan 1, 2013, 2:50:20 AM1/1/13
to partiti...@googlegroups.com, Simon Ho
Hi Kai,

Interesting questions. I've CC'd Simon Ho on this, because he probably has the most sensible things to say about the first question.

On Monday, 31 December 2012 17:39:05 UTC+13, Kai He wrote:
Dear Rob and everyone, 

I have a couple of questions.

When analyzing a multi-locus data using BEAST, I was suggested to give each gene an independent clock model by one of the reviewers, However, I wonder is it still compatible with the partition strategy found by Partitionfinder?  Partitionfinder sometime partitions the data set by codon position, like the 1st and 2nd codon positions of gene A and Bin partition 1, 3rd codon position of gene A and B in partition 2. 
So the question is should I unlink the clock models to be identical to the site models or should I still unlink the clock model by gene?  

There's no particular reason that clock models and site models need to match up. Any combination is possible, the problem is finding a good one (or the best one).

My pragmatic suggestion is to take the partitioning scheme from partitionfinder, and then try out linked and unlinked clocks on that scheme. I would then compare linked and unlinked clocks using Bayes Factors (which you can do in Tracer). Simon Ho and I have a paper which does something similar (although this was before we wrote PartitionFinder) here: http://www.ncbi.nlm.nih.gov/pubmed/20795783

However, there's something a bit more fundamental here - PartitionFinder is written in a likelihood framework, and BEAST in a Bayesian framework. The two things are quite different, and although a partitioning scheme from PartitionFinder should do a perfectly decent job when used in BEAST, it would be more appropriate to do everything in a Bayesian framework. In part, this is because a truly Bayesian approach to partitioning would be to integrate across all possible partitioning schemes and clock models, but it's also true that the 'best' partitioning scheme in a Bayesian framework might differ from that in a likelihood framework. Until recently, there was no software available to do truly Bayesian partitioning, but this just changed - it's now possible to do it, and it's described in this paper: http://t.co/Nau6rAQM. A note though, I haven't tried out the implementation of the approach, and I don't know how user-friendly it is. That would be one for the BEAST forums though.

Given that your paper is already in review, and presumably that new BEAST paper didn't exist when you started out, I think it would be reasonable to just compare linked and unlinked clock models (i.e. following the partitioning scheme you already have) with Bayes Factors, and ignore the more rigorous approach here. I just thought I'd post about it in case you or others were interested.
 

Besides, I wonder if anybody are testing or attempt to test different partition strategy using PS/SS etc.? 


Sounds interesting! What are PS and SS?

Cheers,

Rob

 
Happy new year!

Kai He


Rob Lanfear

unread,
Jan 1, 2013, 3:17:14 AM1/1/13
to PartitionFinder, Simon Ho
Hi Kai,

Just a quick update. I just noticed your attached image, and saw that you have nuclear and mitochondrial genes in there. That suggests a few options for unlinked clocks, you should perhaps compare (using Bayes Factors) some approaches which unlink nuclear and mitochondrial clocks, but link some or all of the clocks within each of those categories. 

Cheers,

Rob 
--
Rob Lanfear
Research Fellow,
Ecology, Evolution, and Genetics,
Research School of Biology,
Australian National University

phone: +61 (0)2 6125 3611

www.robertlanfear.com

Simon Ho

unread,
Jan 1, 2013, 7:09:14 AM1/1/13
to Kai He, Rob Lanfear, PartitionFinder
Hi Kai,

There isn't a good method for selecting the clock scheme - there are too many possible combinations to test exhaustively using Bayes factors. My students and I are working on a method that chooses the best way to assign clock models for multiple loci, but we are still ironing out a few problems. It should probably be available in a month or two. 

From our preliminary analyses of real data sets, we've found that the best first step is usually to assign one relaxed clock to the mitochondrial genes and one relaxed clock to the nuclear genes. You should probably do this for your analysis. You could also try adding further clock models (e.g., one for each nuclear gene), just to see if it makes much difference to the estimates of the tree topology and node times. 

 It's generally not a good idea to have a separate relaxed clock for each partition when there is a large number of partitions, because this massively increases the number of parameters (for minimal gain). 

Cheers,
Simon

ASSOC PROF SIMON HO
ARC QEII Research Fellow
School of Biological Sciences

THE UNIVERSITY OF SYDNEY
Edgeworth David Building A11
Sydney, NSW 2006, Australia
E simo...@sydney.edu.au

Kai He

unread,
Jan 2, 2013, 4:11:26 PM1/2/13
to partiti...@googlegroups.com, Simon Ho
Hi Rob and Simon,

Thank you very much for reply and suggestions.

I was worrying that there could be strange interaction using unmatched clock models and site models. Since it is fine, I will compare different schemes using Bayes factors. 

I have send Chieh-Hsi Wu (first author of the paper "Bayesian selection of nucleotide substitution models and their site assignments") an email. She told me the xml function for the method is not yet available in BEAUti, but already in progress. 

The path sampling (PS) and stepping-stone sampling (SS) have been described in the following papers and proven to outperform the harmonic mean estimator (HME).

Accurate Model Selection of Relaxed Molecular Clocks in Bayesian Phylogenetics


Choosing among PartitionModels in Bayesian Phylogenetics

Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty.




Best, 
Kai He




Brian Muchmore

unread,
Dec 7, 2013, 4:24:34 AM12/7/13
to partiti...@googlegroups.com, Simon Ho
I thought I would revive this discussion now that the molecular clock partitioning program ClockstaR has been released from Simon's group.  I have been in contact with the first author concerning a few questions I had, one of which is pasted below.  I certainly don't want to make it seem like I don't trust his advice - nor am I trying to put his advice on the spot - but I am curious what the other experts on the board think as far as having PartitionFinder and ClockstaR "work" together for the best possible input into a program like BEAST.

Me:  When running PartitionFinder, it gives the option of running the analysis with either linked or unlinked branch lengths.  Presumably, the best thing to do is to run PartitionFinder once with each option, and choose the best scheme according to the BIC score.  My question is how do you see ClockstaR "working" with PartitionFinder?  For example, what do you think if PartitionFinder tells me that the best substitution partitioning scheme is with linked branches while ClockstaR tells me that each partition I feed it should have its own molecular clock?

Him: I have not compared clockstar with the option of unlinking branch lengths in partition finder. I think the best approach is to run clockstar first, assuming a different substitution model for each gene or data subset. Then you can run partition finder on each clock partition. So you would have the clock partition, and within each clock partition some substitution model partitioning scheme selected in partition finder. I think this is the best approach to use both programs. You should probably link the branch length estimation in partition finder for each clock partition, which is also quite conservative because you are reducing the number of parameters.
One thing to keep in mind is that the approaches of the two programs are very different. PartitionFinder uses a greedy algorithm, so it could get stuck in local optima. In clockstar there is no real optimization beyond that of the clustering algorithm and finding the tree distance.

Rob Lanfear

unread,
Dec 7, 2013, 5:29:41 AM12/7/13
to PartitionFinder, Simon Ho
Hi Brian,

This is a good question. I'll have to have a careful read of the ClockstaR paper before I get back to you. But in general it is going to be somewhat ad-hoc whatever you do. So I would suggest doing whatever you think sensible for your dataset, and if you're unsure which is the best approach, just try all the approaches that seem sensible and compare them using Bayes Factors. Bayes Factors have a bit of a bad press, but I think that in this case (i.e. when you have a few sensible approaches to model selection, and no totally Bayesian way to integrate over your uncertainty in those models), Bayes Factors are useful. 

However, I'll reserve proper judgement until I have read the ClockstaR paper again. In the mean time, maybe Simon has something more useful to say.

Cheers,

Rob



--
You received this message because you are subscribed to the Google Groups "PartitionFinder" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partitionfind...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Simon Ho

unread,
Dec 7, 2013, 11:56:25 PM12/7/13
to Rob Lanfear, PartitionFinder
Hi Brian,

Sebastian and I haven't thought about this in detail yet, but any approach to combining the substitution- and clock-model partitioning schemes will probably be ad hoc (as Rob has pointed out). I think that the approach suggested by Sebastian sounds reasonable. He is travelling at the moment, but I'll discuss this with him when he returns. 

We are also about to submit a paper that examines the performance of ClockstaR in more detail. Specifically, we compare the suggested partitioning schemes using Bayes factors, and examine their effects on the resulting estimates of rates and dates. 

Cheers,
Simon

ASSOC PROF SIMON HO
ARC QEII Research Fellow
School of Biological Sciences

THE UNIVERSITY OF SYDNEY
Edgeworth David Building A11
Sydney, NSW 2006, Australia
E simo...@sydney.edu.au



Brian Muchmore

unread,
Dec 8, 2013, 7:26:00 AM12/8/13
to partiti...@googlegroups.com, Rob Lanfear
Thanks for the responses, guys,

Rob, when you have had a chance to read the paper more carefully, I would definitely be curious to know what you think.  Also, do you have a citation for the recent criticism of Bayes factors?  I think I have run across this before while trolling bayesian phylogenetic reading materials, but I would be curious to revisit this.

Right now, it sounds to me like I should implement Sebastian's suggestion (ClockstaR -> PartitionFinder), and the opposite of his suggestion (PartitionFinder -> ClockstaR), and both separate of each other (PartitionFinder - ClockstaR), and compare with Bayes Factors.  But I would love to hear further opinions.

This is such a tricky field of work, so it is great to get input from the people who write these programs.  God-bless google groups.

Rob Lanfear

unread,
Dec 10, 2013, 6:30:36 PM12/10/13
to Brian Muchmore, PartitionFinder
Hi All,

I had a good read of the ClockstaR paper. The approach seems pretty sensible, and there are various reasons why it won't mesh well with the current version of PF.

The biggest issue in combining PF and ClockstaR is that ClockstaR requires you to know your partitions first. Also, from the results in the paper it seems like one tends to estimate far fewer local clocks than model partitions. So, my suggestion is as follows:

1. Define data blocks as usual (e.g. by gene and codon position)
2. Run PartitionFinder to get a partitioning scheme and select the best model for each scheme (use linked branch lengths - this is as close as you can currently get to having independent clocks for each partition)
3. Input those results to ClockstaR

I will add one condition to this - if in step 3 you see that you have a number of local clocks that is close to the number of partitions, then it's probably worth trying the analysis the other way around (i.e. ClockstaR first, then PartitionFinder). My intuition suggests that this is unlikely though, but I could be wrong.

I'd be interested to know what happens when you do this. If anyone feels like reporting results back on this, that would be really helpful to all of us who develop the methods.

Cheers,

Rob

Simon Ho

unread,
Dec 12, 2013, 6:50:52 AM12/12/13
to <partitionfinder@googlegroups.com>, Brian Muchmore
Hi Rob and all,

As Rob has pointed out, the number of clock partitions selected by ClockstaR is typically smaller than the number of substitution-model partitions selected by PartitionFinder. 

In addition, I find it easier to envisage data partitions with different substitution models sharing the same clock model, rather than data partitions with different clock models sharing the same substitution model.

For these reasons, I think that PartitionFinder -> ClockstaR is probably the better approach. 

Cheers,
Simon

ASSOC PROF SIMON HO
ARC QEII Research Fellow
School of Biological Sciences

THE UNIVERSITY OF SYDNEY
Edgeworth David Building A11
Sydney, NSW 2006, Australia
E simo...@sydney.edu.au


Brian Muchmore

unread,
Dec 13, 2013, 7:27:09 AM12/13/13
to partiti...@googlegroups.com, Brian Muchmore
This is part of the same discussion, so I will mention it here even though it is really about ClockstaR:

One current difficulty with ClockstaR is that it will throw you an error while model testing if your data partitions don't have enough variation, and it is currently impossible to manually set the substitution model for the partition.  There is a work around for this (pasted below, courtesy of Sebastian), but it would be great if one could manually set the partition substitution schemes, so that the PartionFinder --> ClockstaR work flow was a more seamless transition.

Sebastian:
One way to get around this is to load a list of the trees with branch lengths optimized in an other program with the corresponding substitution model. Then you get ClockstaR to estimate the distance between the trees, and find the best partitioning scheme. You can do this in with the function get.all.groups(), which takes a list of trees as the argument. Then you can use the output of get.all.groups() as an argument to the function diagnostics.clockstar() to get the standard results. You can see how this works with the pdf manual and the examples for the functions. However, I will also upload a tutorial of how to do this soon.
Reply all
Reply to author
Forward
0 new messages