I've been hoping to use my new laptop to tackle some old analyses of a large dataset. The dataset is of chloroplast alignments, with 14 partitions and c. 100,000 bp.
(I'm using BEAST 2.4.6)
The output of >beast -beagle_info gives:
BEAGLE resources available:
0 : CPU
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_CPU
1 : GeForce GTX 1070
Global memory (MB): 8112
Clock speed (Ghz): 1.65
Number of cores: 3072
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA
2 : GeForce GTX 1070 (OpenCL 1.2 CUDA)
Global memory (MB): 8111
Clock speed (Ghz): 1.65
Number of multiprocessors: 16
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_OPENCL
I have only the one graphics card, but there are two entries because both CUDA and OpenCL are in place (a pain to get working). I'm using Ubuntu 16.04.
I am hoping to get the most efficiency out of this configuration, and having read the blog about load balancing (
https://www.beast2.org/tag/beagle/), I wanted to try out different options.
Here is what I have found with my data set:
running
>beast input.xml uses the CPU, 14 instances (100% CPU): gets about 9 mins/M samples
>beast -threads -1 input.xml uses CPU with more instances (c. 100) (about 460% CPU): gets about 12.5 mins/M samples
>beast -beagle_GPU input.xml uses the GPU, 14 instances (CUDA only): gets about
7 mins/M samples
>beast -beagle_GPU -threads -1 input.xml gives a CUDA out of memory error from <GPUinterfaceCUDA.cpp>
>beast -beagle_order 0,1 input.xml uses only CPU (as first run): gets about 9.5 mins/M samples
>beast -beagle_GPU -beagle_order 0,1 input.xml uses CPU and GPUs (7 instances each): gets about 12 mins/M samples
>beast -beagle_GPU -beagle_order 0,1,2 input.xml starts using the OpenCL resource as well: gets about 13.5 mins/M samples
>beast -beagle_GPU -beagle_order 1,2 input.xml uses only the two GPU types: gets about 9.5 mins/M samples
>beast -beagle_GPU -beagle_order 0,0,1,1,1,1,1,1,2,2,2,2,2,2 input.xml puts two threads on CPU, six on GPU type 1, six on GPU type 2: gets about 13 mins/M samples
>beast -beagle_GPU -beagle_order 2 input.xml uses the OpenCL for 14 instances: gets about 13.5 mins/M samples
>beast -beagle_GPU -beagle_order 1 -threads 28 input.xml gives CUDA out of memory error at 22 instances
>beast -beagle_GPU -beagle_order 1 -threads 21 input.xml gives CUDA out of memory error with 22 instances (why? multiple?)
>beast -beagle_GPU -beagle_order 0,1 -threads 28 input.xml gives CUDA out of memory error with 22 instances each
>beast -beagle_GPU -beagle_order 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1 input.xml gives 14 CPU instances only: gets about 11 mins/M samples (weird?)
The best option appears to be the CUDA GPU without any tinkering with order. I still can't figure out how to do load scaling.
It seems like the GPU can only handle 21 threads, so my 14 partition dataset doesn't fit more than one time.
Does anyone have any suggestions about getting more efficiency out of my resources? The partitions are different sizes, so perhaps it would make sense to divide them up more carefully. Does anyone know how to implement the scaling that is mentioned on the load balancing blog?
Any recommendations would be appreciated.
Cheers,
Ben Anderson