Re: PartitionFinder with big dataset

521 views
Skip to first unread message

Rob Lanfear

unread,
Sep 14, 2012, 10:51:03 AM9/14/12
to partiti...@googlegroups.com
Hi Michael,

Thanks for your email - sorry for the slow reply, I've been travelling. Answers to your very interesting questions below:

On Friday, September 14, 2012 12:12:00 AM UTC-4, Michael wrote:
Hi Rob,
I've been trying to run PartitionFinder on a big dataset of 1200 sp., 35000 bp (82% missing data), subdivided into 74 datablocks. I'm just trying to figure out at the moment whether PartitionFinder can handle a data set of that size in a reasonable amount of time.

That's a tough one. Ultimately, PartitionFinder is based on PhyML, which was not built to handle enormous datasets. It's certainly possible to get this done, but it might be very slow. We're working on another version which will get around this problem in various ways - specifically with datasets like yours in mind, but I can't guarantee that it will be ready any time very soon (sorry!).
 
I've been running it on a new MBP (16 GB memory, 8 cpus), and I chose the greedy algorithm, linked branch lengths, the raxml models, and the BIC model selection criterion. It took PartitionFinder about 40 hrs to build the BioNJ tree and estimate GTR+I+J branch lengths, before it started analyzing the 5403 subsets. After 6 more hours, 26/5403 subsets had been analyzed.
Now, does PartitionFinder usually have to go through the maximum number (or almost the max number) of subsets, or might the analysis finish after analyzing a much smaller number of subsets?

The minimum number of subsets it will have to analyse is 50% - this is because the first round of the algorithm involves calculating the likelihood of N choose 2 subsets. Whether it finishes at 50% or nearer 100% depends entirely on your data, and on how finely you've divided it up in your initial subsets. With 82% missing data my guess would be that it would be nearer the 100% mark than the 50% mark.
 
Does the time needed per subset increase towards the end of the analysis, possibly because the largest subsets are the last ones to be analyzed?

Yes. In my experience this has not been an enormous increase though - perhaps 2-3 fold in the worst cases.
 
Once the 5403 subsets are analyzed, would PartitionFinder finish quickly or are there other time-consuming steps waiting after the subset analysis?

It should finish quickly after that.
 
Do you see any possibility to further speed up the analysis? I see that PartitionFinder uses only 2 out of the 8 cpus available, presumably cause I'm only testing the two raxml models (GTR+G,GTR+I+G). Is there a way to run more than two threads at the same time even when testing only two models?

Yes. The best approach here would be to parallelise multiple subsets at once. That's not in the current version - to add it in would require some non-trivial (but very much possible) modifications to the code. We're not planning on going down this route, because the other modifications we're making will paralellise everything at the site level, so making higher levels redundant. However, if you did want to modify the code you're obviously very welcome - it is all on GitHub.

 
Would be great if you could give me some feedback on this.

I guess your dataset is just about on the edge of what PF is able to achieve in its current form. Whether this is OK for you will depend on whether you are willing to wait for the run to finish. It sounds from the information you've given like this would take somewhere in the region of 2 months. That's certainly faster than we'll be able to finish the next version.

In your case, however, my best advice would be to choose a sensible partitioning scheme for your data and just go with that. We showed in the original paper for PartitionFinder that if 'optimal' partitioning isn't possible, then partitioning by gene and codon position was always the next best bet. With a dataset of your size, I would just go for that, and run it in RaxML.

Cheers,

Rob
 
Cheers,
Michael

Rob Lanfear

unread,
Sep 16, 2012, 12:01:22 PM9/16/12
to partiti...@googlegroups.com
Hi Michael,

That sounds like a neat hack. Right now we don't order subsets when we analyse them, so the order in the .cfg file won't have any large effect. The easiest thing would be to copy the entire analysis folder, start a separate run, then about once a day you could consolidate both runs into a single file and start again.

Not perfect, but it should give you some significant speed ups...

Cheers,

Rob 


On Saturday, September 15, 2012 11:56:59 PM UTC-4, Michael wrote:
Hi Rob,
thanks for your detailed and still very fast reply. Great to hear that there will be faster versions coming out eventually. In the meantime, I'm afraid I won't be able to paralellise the current code, I just don't feel experienced enough with python and multi-threading. From what I read in this forum, I was wondering if there was an even easier solution to this. Apparently, subsets that had been analysed previously don't need to be reanalysed if they're in the right format and folder (as described in post 'Reanalyzing "all" after "greedy"'). So if I would start PF twice at the same time (to use 4 cpus instead of 2), and if I could get it to start analysing subsets from different ends of the list, I could probably stop both runs after a while, throw together the analysed subsets from the two different 'analysis/subsets/' folders, and then restart a single run that reuses all the previously analyses subsets. To do this, I could probably get PF to start on different ends of the subset list either with minor code modifications or by simply producing two versions of the cfg file, with one that has the subset list in reversed order. I'll play around with this...
Cheers, Michael

Rob Lanfear

unread,
Sep 30, 2012, 9:48:14 PM9/30/12
to partiti...@googlegroups.com
Hi Michael,

That's awesome! Thanks for putting in all that work and letting others benefit. 

We're working on ways to improve the parallelisation, so this is super helpful.

Cheers,

Rob

On Sunday, September 30, 2012 9:34:09 PM UTC-4, Michael wrote:
Hi Rob,

the hack worked! PartitionFinder is done with my dataset after about two weeks, discarding the GTR+G+I model, and running only GTR+G on 6-7 cores. I wrote a wrapper script (in Ruby that's what I know best) that automatically stops the PartitionFinder analysis each time a new step of the greedy algorithm is reached. It then copies the working directory to a number of replicate directories, modifies the cfg files in each replicate so that they all work on different schemes (making use of the user-defined schemes section), and starts the replicate runs. After the replicate runs have finished, the 'subsets' and 'phyml' folders of all replicate directories are merged with the respective folders of the original working directory, and PartitionFinder is again started in the original directory, until either the next greedy algorithm step is reached or the analysis finishes. I attached the script in case anybody else would like to play around with it. It really hasn't been tested thoroughly, and the output may be a bit confusing (all replicates writing simultaneously), but it worked fine for me, on a OSX10.8, with the current version of PartitionFinder, and Ruby 1.9.3. If anybody needs help with it, let me know. Some information on how to run it is in the script header. Memory consumption was up to 1 GB per replicate, so depending on memory and dataset size, people may want to run less replicates.

Cheers,

Michael
Reply all
Reply to author
Forward
0 new messages