Tips and tricks: How I boosted BEAST performance by 2 fold

3,490 views
Skip to first unread message

Qiyun Zhu

unread,
Jan 22, 2014, 1:58:52 AM1/22/14
to beast...@googlegroups.com

Dear all,

I have been playing with a big multi-partition dataset for a while, trying to reduce the amount of calculation time it cost from one week to a reasonable level, without changing the XML file or the computer hardware. I got some quite significant improvements in speed and would like to share what I tried with the community.

Summary:

Things that help: BEAGLE, SSE, multiple BEAGLE instances (for single partition data), no scaling > dynamic > always, a decent GPU device, two or more GPU devices, deliberately designed BEAGLE resource order, one GPU holding multiple partitions, newer version of BEAST (1.8.0), XMP (an Intel memory tech), Linux (comparing to Windows).
Things that have little effect: using BEAGLE_GPU without careful consideration, multiple BEAGLE instances (for multi-partition data), the way one launches BEAST, maximum memory allocated to Java, other moderate tasks you are doing on your computer.
Things I am not sure: CPU overclocking, multiple threads, power saving features.

Now I am showing my procedures and benchmark results step by step. I label good things in green and bad or irrelevant things in red.

Hardware specifications:
CPU: Intel Core i7-4820K (Ivy Bridge-E, 4 core, 8 threads, 3.7-3.9 GHz, 10 MB cache)
Memory: 2x8GB dual channel DDR3, 1866 MHz, CL9
1st graphic card: Geforce GTX 650 (384 cores, 1058 MHz, 1 GB GDDR5 @ 80 GB/s)
2nd graphic card: Geforce GT 640 (384 cores, 900 MHz, 2 GB DDR3 @ 28.5 GB/s)
*Comment: one power CPU plus two beginner-level GPUs constitute the computation resources. I was trying to do a quad channel memory set, but some DIMMs on the motherboard are bad... So I left it with dual channel. The whole machine cost less than 1000 USD.

Software environment:
Ubuntu 13.10 desktop amd64, with Linux kernel version 3.11.0-15.
Oracle Java SE runtime version 1.7.0 update 51.
NVIDIA driver version 319.60
BEAGLE version 2.1
BEAST version 1.8.0

Dataset:
The big dataset contains 13 DNA partitions plus one binary partition, with a total of 8786 sites and 5301 unique patterns.
Partition sizes are uneven. The largest partition has 969 patterns, followed by 788, 766, ... with the smallest being 82.
50 million generations of MCMC is necessary to get sufficient samples after convergence.
In benchmark the first 100 thousand generations are executed with the same seed, logging every 1 thousand generations.

Before start:
1. In my previous experience, the same analysis runs notably faster in Linux than in Windows on the same machine with the same version of BEAST and BEAGLE. I didn't benchmark this time, for I don't have Windows on this machine. I am not sure if it's general rule.
2. I turned on XMP in BIOS. In my previous benchmark on another machine, this switch reduced running time from 16m54s to 14m57s.
3. I leaved the default and resource-consuming Ubuntu Unity shell on. In my previous benchmark, switching it to the lightweight LXDE did not accelerate the analysis.
4. I tried to overclock CPU to 4.3 GHz, however, BEAST still runs at base frequency of 3.7 GHz on this dataset, sounding as if Turbo Boost didn't work. The CPU load is around ~550% (out of 800%). Meanwhile, with other single partition datasets I tested, BEAST can and always run at the maximum frequency. I wonder if it's a bug, or, BEAST just isn't pushing the CPU to full load on multi-partition datasets. Anyway overclocking isn't typically an option for serious researchers, so I didn't proceed with it.
5. I tried to execute beast directly or used the indirect way java -Xmx####m -jar lib/beast.jar, in order to allocate more memory to BEAST. They both ran at the same pace. More memory didn't seem to help, but sometimes help to overcome overflow problem if the dataset is large.
6. I did not switch off any power-saving options in the operating system or in BIOS, because I noticed that they make little difference. But maybe they help on other machines.
7. I was able to run Firefox, MS Word (yes under wine), Adobe Reader and amny other stuffs along with BEAST analyses without even changing the meter read. This also indicates that BEAST does not drain up all CPU power.
8. I kept the BEAST 1.8.0 defaults - dynamic scaling (-beagle_scaling dynamic) and double precision (-beagle_double). In my previous experience, they performed well comparing to no scaling and single precision.

Baseline:
Simply ran BEAST on this dataset, without any additional parameters. It took 12.22 min.
The same result was obtained with BEAST version 1.7.5.

BEAGLE (-beagle):
Turned on BEAGLE, and the time was reduced to 11.40 min.
The increase of performance was notable in this test, and was even much more dramatic in my other analyses. Anyway, I think this option should always be on.

SSE (-beagle_sse):
Turn on the SSE flag for CPU, and the time was reduced to 11.18 min.
This effect is even larger on other machines I tested.
I left this on for the subsequent tests.

Multiple threads (
-threads):
2 threads: 12.55 min, 3 threads: 11.17 min, 4 threads: 10.99 min (best performance), 5 threads: 11.20 min, 6 threads: 11.19 min, 7 threads: 11.27 min, 8 threads: 11.18 min.
I'm not sure how this works, but just feel that BEAST uses all core/threads by default. I didn't keep going with this option.

Multiple BEAGLE instances (-beagle_instances):
2 instances: 11.54, 4 instances: 12.74, 6 instances: 14.02, 8 instances: 15.44
As I understand this option will set the number you typed for each partition. So I guess number of partitions x number of instances should not exceed number of core/threads of the machine. In this test since there are more partitions (14) than core/threads (8), I'd better not touch this option.

GPU:
Turn on GPU (-beagle_gpu) along with SSE, and it cost 11.60 min. Even slightly worse than not using it.
By default the first GPU (GTX 650) was set to handle the last partition in the dataset. I'm not sure how BEAST designates this partition. The other GPU (GTX 640) was not used.
In my other analyses, especially in a computer with poor CPU and decent GPU, this option could mean a huge acceleration (more than 4 fold in my case).
I guess whether GPU speeds up / slows down the calculation depends on the specific machine adn the partitioning scheme of the dataset. Obviously BEAST's default setting is not optimal enough, so I decided to play it a bit more.

CUDA:
By default BEAST uses OpenCL instead of CUDA on the GPU. Switching it to CUDA (-beagle_cuda) reduced time to 11.10 min.
The reason to choose OpenCL is that ATI cards also support OpenCL (those I didn't try), while CUDA is only supported by NVIDIA cards.

BEAST 1.7.5 with GPU:

The analysis just could not proceed. The screen said "underflow". I switched to single precision and it ran, with lots of "underflow" warning. But when I fed Java with more memory, the analysis went smoothly and took 11.60 min.
It seems that BEAST 1.8.0 did a better job with 1.7.5 in GPU support.

Smart assignment of BEAGLE resources (-beagle_order):
This is the biggest plus I came across so far! High recommendation!

Basic usage:
Execute beast -beagle_info, the program listed all available BEAGLE resources. Typically, CPU is #0 and GPU is #1.
Suppose there are 5 partitions and one wants to have GPU work on the 3rd partition. There we should type:
beast -beagle_order 0,0,1,0,0 input.xml

Note: BEAGLE order by default uses CUDA instead of OpenCL. So just leave as is.
Note: SSE is not compatible with BEAGLE order. Just switch off SSE or it will overrides any GPU usages.

I used this option to assign the GTX 650 GPU to the largest partition instead of the last partition. The calculation time then became 9.68 min.

(Pseudo) alternative GPU resource:
In the displayed BEAGLE resources (by -beagle_info), each graphic card was listed twice, with different descriptions. BEAST uses the first resource by default. I tried to set the second resource on the largest partition, the calculation slowed down to 10.00 min.


Multiple GPUs:
Given I have two graphic cards, I let GTX 650 handle the largest partition and GT 640 handle the 2nd largest partition. The time further decreased to 8.52 min.

One GPU on multiple partitions:

I used to learn that one GPU can only handle one partition, however in this version of BEAST and BEAGLE, I found that assigning more than one partition to one GPU gave me a significant boost of power.
I assigned the 1st and 2nd largest partitions to GTX 650, the time was 8.38 min.
Then assigned the 1st, 2nd and 3rd largest partitions to GTX 650, the time was 8.06 min.
Then I went further by have top 4 partitions on to GTX 650, then the performace decreased: 9.94 min.

It appears that a GPU has its limitation of capacity. But we can squeeze as much as possible before the limitation is reached.

Finally I combine multiple GPUs and multiple partitions. I tested several combinations, and found out one combination that has two GPUs taking care of two particular partitions each (GTX 650 on 1st and 2nd largest, GT 640 on 3rd and 4th largest) gave me best performance boost. The final number of this series is 6.75.

I didn't test every single possible combination, for that is too many. It is very likely that there exists a better combination that further reduces the calculation time to a more pleasing level.

Conclusions:
A. without GPU:
For single-partition dataset, do: beast -beagle_sse -beagle_instances <number of core/threads in your CPU> input.xml
For multi-partition dataset, do: beast -beagle_sse input.xml
B. with GPU:
For single-partition dataset, do: beast -beagle_sse -beagle_gpu -beagle_cuda input.xml
For multi-partition dataset, do: beast -beagle_order <a smart scheme based on your tests if you have time> input.xml

That's it. I hope my experience would somehow relieve other BEAST users who are struggling with the long time they have to spend on Bayesian inference. There may be incorrect, arbitrary or case-specific statements in this article, please read with caution and hopefully let me know if I am wrong. Thanks!

Qiyun Zhu


alex leung

unread,
May 5, 2014, 6:18:29 AM5/5/14
to beast...@googlegroups.com
Zhu's sharing is very useful as a reference. I wonder if anybody can share their thoughts on selecting the following components for a computer mainly (90% of time) used for BEAST analysis on mainly 3 types of dataset:

I)   14000 DNA sites (or 8 partitions of 1000-2000 site each)
II)  10000 DNA sites
III) 70 partitions of 1000-9000 sites each

1. CPU
a) Intel Xeon 8 Core 16 Thread E5-2690 2.9GHz (20M Cache,LGA 2011,32nm, 8GT/s Intel QPI) US$ 2100
b) Intel Xeon 4 Core 8 Thread E3-1275V3 Haswell 3.5GHz with 4600 Graphics (8M Cache,LGA 1150,22nm) BOX US$ 350
c) Intel Core i7-4960X Extreme Edition (3.6GHz,15M Cache,LGA 2011) CPU BOX w/o Fan US$ 1050
d) Intel Core i7-4820K (3.7GHz,10M Cache,LGA 2011) CPU BOX w/o Fan US$ 350

2. RAM
a) Kingston DDR3 1600MHz 8GB ECC+Register RAM KVR16R11D8/8G US $125
b) Kingston DDR3 1600MHz 8GB RAM KVR16N11/8G     US $75

3. Display card
a) GTX650-E-1GD5 GTX650 1GB DDR5(no 6PIN for power supply) US $100
b) GTX760-DC2OC-2GD5 GTX760 2GB GDDR5 PCI-E Display Card US $275


Perhaps power supply & cooling (the computer will be placed under normal office environment (~23 degree centigrade)) have to be taken care of? The budget is about US $3000, though can spend more if really necessary.

Mel Melendrez

unread,
Sep 16, 2014, 3:05:26 PM9/16/14
to beast...@googlegroups.com
Great post, thank you so much, this helped greatly in giving me ideas for how to run my dataset and setting up my own benchmarks to figure out the fastest route through a bayesian analysis.

Cheers,
Mel

David Lee

unread,
Jul 24, 2015, 8:06:40 PM7/24/15
to beast-users, qiyu...@gmail.com
Dear Zhu,

Thanks for your great test! It is very helpful to us.
I am a beginner at BEAST so far.
And wanna ask you two questions,
1. How did you estimate that sufficient generations of MCMC  samples after convergence?
2. I read some books, and they also mentioned partitions, however, I cannot figure out, why partition(s) is very important either in ML or BEAST.
    I used raxml (GUI ver.) to run some data, the default output is bipartition, I do not know why, and how could I know how many partitions in my data set?

Thanks for all,

David

Qiyun Zhu於 2014年1月22日星期三 UTC+8下午2時58分52秒寫道:

sara

unread,
Feb 14, 2016, 9:31:14 AM2/14/16
to beast-users
Dear Qiyun Zhu,

thanks a lot for your post, I am trying to figure out how to make my analyses faster and I am a bit confused about what some beagle specifications mean, maybe you can give me a little help here:

my dataset is 30 genes (partitions) and overall > 4000 site patterns (variable across loci)
have access to two nodes with 1GPU+1CPU (12 cores) plus one machine w/ 12CPU cores + 3GPU's.

have been running datasets w/
beast -beagle_order 1 input.xml

(but it is still a huge amount of time, ess's on lk are low, so will have to do multiple runs and combine... so its sounding a bit unpracticable... I will have to make several runs in GPU's and CPU's to get final results)

first some quick doubts:
1) though that the flag -beast_sse was only for CPU but realize by your post that not. can u explain me a bit why u use it for multipartition for GPU and single partition for CPU? Is that because then w/ -beagle_order you will assign some partitions for the CPU (and use SSE)?

2) still a bit confused on the beagle_instances and threads. should they both refer to the number os threads/cores in a machine being used?
I found this on a recent post by ramko bouckaert:

On Friday, June 6, 2014 1:37:18 PM UTC+12, Remco Bouckaert wrote:
The beagle_instances flag appears to be ignored in BEAST 2 (not quite sure why this happens). 
Using both beagle-instance and threads flags  ensures BEAST uses multiple cores using BEAGLE, like so:
beast -beagle_instances 2 -threads 2 
When the beagle-instance flag is set, but the threads flag is not, only a single thread is used and only one BEAGLE instance is created.
Remco
here: https://groups.google.com/forum/#!searchin/beast-users/use$20beagle/beast-users/goQ51Yosn1E/l0itIzY8X8IJ


my current idea (as I have 30 partitions - more that any cores on any node) is that I should not play with instances and use -threads only if I want to limit the number of processors it uses in a machine. Is that correct?

3) then I see by your usage of -beagle_order that you specify all partitions (by their order in the input) and decide which ones to send to GPU and CPU. I was always simply sending everything to the GPU, as though it would be faster:

Am I totally wrong??

I will make some tests based on what you posted here in my datasets, but any input is very very welcome. thanks in advance,
sara

Qiyun Zhu

unread,
Feb 16, 2016, 1:35:02 PM2/16/16
to beast...@googlegroups.com
Dear Sara,

I am glad that my post provided you some inspiration. I wrote it a long time ago, with my preliminary understanding of BEAST and Beagle as a black box. After that my

SSE is a CPU instruction set, and GPUs don't have it. Ideally, BEAST should turn on SSE for CPU and do as is on GPU. However, the version I tested (1.7.5) have some compatibility issues in turning on both -beagle_sse and -beagle_order. I hope that issue was already fixed in later versions. Anyway specifying them both will not hurt.

By the way, the descendant of SSE is AVX, then AVX2. You said that your computers have 12 cores, so I guess that they are the Xeon X5600 generation, which do not support AVX. But if my guess is wrong and your machine is more up-to-date (Xeon E5 or above), you can try that AVX option instead of SSE, if available.

As I understand, it is much more efficient to compute with GPU than with CPU, if the task is suitable (meaning parallelizable). The reason for not sending everything to GPU, as I guess, is that there is an upper limit of computing capacity per GPU (and CPU). I am not sure what bottlenecks the capacity, but I guess that it is the I/O bandwidth. In my tests ram channels and frequency matters a lot in BEAST performance. And GPUs, with thousands of tiny cores and moderate amount of ram, is heavily affected by the velocity of data exchange between cores and ram, and the CPUs.

I guess that the ultimate solution is to invest on a computer with higher I/O capability. Your computer, to the best of my guess, could be a dual Xeon X5690 (up to 3 channel DDR3 1333 MHz, QPI = 6.4 GT/s, 12 MB cache) + 3x Tesla M2090 (6GB ram at 384-bit, PCIe 2.0 interface). Since then the raw power of CPU haven't evolved much, but the I/O tech has been improved dramatically. Currently the comparable workstation model would be, for example, E5-2600 v3 series: 4 channel DDR4 2133 MHz, QPI = 9.6 GT/s, up to 45 MB cache.

As for GPU, Tesla K series may be the decent bet, since they have high double precision Gflops which is required by BEAST. Later Tesla M series feature combined precision, and I am not sure if BEAST likes that. Honestly I guess that a cheapo used AMD gaming card with high double precision gflops and high bandwidth (e.g., Radeon HD 7970 GHz edition) may work as well as a decent expensive Tesla card in BEAST.

Best.
Qiyun





--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/P7OmgU_1qCo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages