Thanks for your email - sorry for the slow reply, I've been travelling. Answers to your very interesting questions below:
On Friday, September 14, 2012 12:12:00 AM UTC-4, Michael wrote:
Hi Rob,
I've been trying to run PartitionFinder on a big dataset of 1200 sp., 35000 bp (82% missing data), subdivided into 74 datablocks. I'm just trying to figure out at the moment whether PartitionFinder can handle a data set of that size in a reasonable amount of time.
That's a tough one. Ultimately, PartitionFinder is based on PhyML, which was not built to handle enormous datasets. It's certainly possible to get this done, but it might be very slow. We're working on another version which will get around this problem in various ways - specifically with datasets like yours in mind, but I can't guarantee that it will be ready any time very soon (sorry!).
I've been running it on a new MBP (16 GB memory, 8 cpus), and I chose the greedy algorithm, linked branch lengths, the raxml models, and the BIC model selection criterion. It took PartitionFinder about 40 hrs to build the BioNJ tree and estimate GTR+I+J branch lengths, before it started analyzing the 5403 subsets. After 6 more hours, 26/5403 subsets had been analyzed.
Now, does PartitionFinder usually have to go through the maximum number (or almost the max number) of subsets, or might the analysis finish after analyzing a much smaller number of subsets?
The minimum number of subsets it will have to analyse is 50% - this is because the first round of the algorithm involves calculating the likelihood of N choose 2 subsets. Whether it finishes at 50% or nearer 100% depends entirely on your data, and on how finely you've divided it up in your initial subsets. With 82% missing data my guess would be that it would be nearer the 100% mark than the 50% mark.
Does the time needed per subset increase towards the end of the analysis, possibly because the largest subsets are the last ones to be analyzed?
Yes. In my experience this has not been an enormous increase though - perhaps 2-3 fold in the worst cases.
Once the 5403 subsets are analyzed, would PartitionFinder finish quickly or are there other time-consuming steps waiting after the subset analysis?
It should finish quickly after that.
Do you see any possibility to further speed up the analysis? I see that PartitionFinder uses only 2 out of the 8 cpus available, presumably cause I'm only testing the two raxml models (GTR+G,GTR+I+G). Is there a way to run more than two threads at the same time even when testing only two models?
Yes. The best approach here would be to parallelise multiple subsets at once. That's not in the current version - to add it in would require some non-trivial (but very much possible) modifications to the code. We're not planning on going down this route, because the other modifications we're making will paralellise everything at the site level, so making higher levels redundant. However, if you did want to modify the code you're obviously very welcome - it is all on GitHub.
Would be great if you could give me some feedback on this.
I guess your dataset is just about on the edge of what PF is able to achieve in its current form. Whether this is OK for you will depend on whether you are willing to wait for the run to finish. It sounds from the information you've given like this would take somewhere in the region of 2 months. That's certainly faster than we'll be able to finish the next version.
In your case, however, my best advice would be to choose a sensible partitioning scheme for your data and just go with that. We showed in the original paper for PartitionFinder that if 'optimal' partitioning isn't possible, then partitioning by gene and codon position was always the next best bet. With a dataset of your size, I would just go for that, and run it in RaxML.
Cheers,
Rob
Cheers,
Michael