Making use of a graphics card with BEAGLE

857 views
Skip to first unread message

bander...@gmail.com

unread,
Jun 26, 2017, 8:05:20 PM6/26/17
to beast-users
I've been hoping to use my new laptop to tackle some old analyses of a large dataset. The dataset is of chloroplast alignments, with 14 partitions and c. 100,000 bp.
(I'm using BEAST 2.4.6)

The output of >beast -beagle_info gives:

BEAGLE resources available:
0 : CPU
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_CPU


1 : GeForce GTX 1070
    Global memory (MB): 8112
    Clock speed (Ghz): 1.65
    Number of cores: 3072
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA


2 : GeForce GTX 1070 (OpenCL 1.2 CUDA)
    Global memory (MB): 8111
    Clock speed (Ghz): 1.65
    Number of multiprocessors: 16
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_OPENCL

I have only the one graphics card, but there are two entries because both CUDA and OpenCL are in place (a pain to get working). I'm using Ubuntu 16.04.
I am hoping to get the most efficiency out of this configuration, and having read the blog about load balancing (https://www.beast2.org/tag/beagle/), I wanted to try out different options.
Here is what I have found with my data set:

running
>beast input.xml uses the CPU, 14 instances (100% CPU): gets about 9 mins/M samples

>beast -threads -1 input.xml uses CPU with more instances (c. 100) (about 460% CPU): gets about 12.5 mins/M samples

>beast -beagle_GPU input.xml uses the GPU, 14 instances (CUDA only): gets about 7 mins/M samples

>beast -beagle_GPU -threads -1 input.xml gives a CUDA out of memory error from <GPUinterfaceCUDA.cpp>

>beast -beagle_order 0,1 input.xml uses only CPU (as first run): gets about 9.5 mins/M samples

>beast -beagle_GPU -beagle_order 0,1 input.xml uses CPU and GPUs (7 instances each): gets about 12 mins/M samples

>beast -beagle_GPU -beagle_order 0,1,2 input.xml starts using the OpenCL resource as well: gets about 13.5 mins/M samples

>beast -beagle_GPU -beagle_order 1,2 input.xml uses only the two GPU types: gets about 9.5 mins/M samples

>beast -beagle_GPU -beagle_order 0,0,1,1,1,1,1,1,2,2,2,2,2,2 input.xml puts two threads on CPU, six on GPU type 1, six on GPU type 2: gets about 13 mins/M samples

>beast -beagle_GPU -beagle_order 2 input.xml uses the OpenCL for 14 instances: gets about 13.5 mins/M samples

>beast -beagle_GPU -beagle_order 1 -threads 28 input.xml gives CUDA out of memory error at 22 instances

>beast -beagle_GPU -beagle_order 1 -threads 21 input.xml gives CUDA out of memory error with 22 instances (why? multiple?)

>beast -beagle_GPU -beagle_order 0,1 -threads 28 input.xml gives CUDA out of memory error with 22 instances each

>beast -beagle_GPU -beagle_order 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1 input.xml gives 14 CPU instances only: gets about 11 mins/M samples (weird?)


The best option appears to be the CUDA GPU without any tinkering with order. I still can't figure out how to do load scaling.
It seems like the GPU can only handle 21 threads, so my 14 partition dataset doesn't fit more than one time.

Does anyone have any suggestions about getting more efficiency out of my resources? The partitions are different sizes, so perhaps it would make sense to divide them up more carefully. Does anyone know how to implement the scaling that is mentioned on the load balancing blog?

Any recommendations would be appreciated.

Cheers,

Ben Anderson

Andrew Rambaut

unread,
Jun 27, 2017, 4:54:57 AM6/27/17
to beast...@googlegroups.com
Hi Ben,

As you say, resource 1 and 2 are the same graphic card just accessed through a different API. Using CUDA is probably more efficient so stick with just using that (and the CPU). Basically the more site patterns you have in a GPU instance of BEAGLE the better. So I suggest dividing up the partitions between CPU and GPU (CUDA) such that the largest are on the GPU. Perhaps put twice as many smaller partitions on the CPU as there are cores. The balance of partitions on CPU and GPU is the thing you will need to test (i.e., similar to your last run but more weighted to GPU, I suspect).

All the partitions on the GPU will be done sequentially so you may not get huge speed ups - if all the sites were in one partition you probably would. I suggest trying a large single partition on GPU to get a best case for benchmarking.

Andrew




--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

bander...@gmail.com

unread,
Jun 27, 2017, 8:52:02 PM6/27/17
to beast-users
Thanks Andrew!

I went ahead and tried what you recommended. To benchmark, I put all my data into a single partition, and ran it on the GPU only:

>beast -beagle_GPU input.xml
This gave about 2 min 20s/M samples (!!!)

I also tried to divide the original partitions better between the CPU and the GPU. I tried putting the eight smallest partitions on the CPU and six largest on the GPU:
>beast -beagle_GPU -beagle_order 1,1,1,1,0,0,1,1,0,0,0,0,0,0 input.xml
This gave about 6 min/M samples, so a bit of an improvement over my previous best of 7 min/M samples.

I also tried the smallest 4 on the CPU, and the larger 10 on the GPU
>beast -beagle_GPU -beagle_order 1,1,1,1,1,0,1,1,1,0,1,1,0,0 input.xml
This gave about 6 min 20 sec/M samples, so pretty good, but slightly slower than the former.

Thanks for the help!
Cheers,
Ben
Reply all
Reply to author
Forward
0 new messages