Hi all,
I made a version of BUCKy which allows parallel execution using OpenMPI. You can find it here:
https://github.com/tkchafin/mpi-buckyI am currently "alpha testing" it, and there are some known issues: --create-sample-file is not currently working, and there are some known memory leaks. I am working on tracking down all the memory leaks, and will fix the --create-sample-file option when I have time. If you find any issues please let me know so I can add them to my to-do list. My email address is:
tkch...@uark.eduAs for the speedup, there is some effect of the communication cost between threads. For each "exchange" between chains in the MCMCMC there are several MPI_Send/Recv's which have to happen, and additionally at the end of sampling each thread communicates its local data back to the "master" process. So, we don't get a fully linear speedup in adding more threads. In some runs on the example dataset, using 2 runs each with 4 chains, and 1 million iterations, I saw a
2X speed-up using 2 threads, 3.5X speed-up using 4 threads, and 6.2X speed-up using 8 threads, showing that the parallel implementation does speed things up quite a bit, even factoring in the communication cost.
Anyways I hope someone finds this helpful. I made it because running BUCKy on very large datasets was taking too long and exceeding walltime limits on my university's cluster, so I was hoping to speed things up and bit and thought others might find it useful as well.
You will need to have OpenMPI installed (and the mpic++ compiler in your path), and additionally you will need to wrap your bucky command-line call with the proper mpirun call. For example, to run MPI-enabled bucky across 4 threads, with 2 runs and 4 chains each (assuming you are in the directory containing the bucky executable):
mpirun -np 4 ./bucky -k 2 -c 4 -s1 1234 -s2 2345 -n 1000000 <*.in files>
Note the "-np 4" specifies the number of processes used. Do not exceed the number of runnable threads on your machine(s), and also keep in mind that memory use will scale with the number of threads as each daughter process keeps its own local copy of most of the data structures used in BUCKy. I haven't characterized memory usage/growth of the program yet because I typically run it on high memory nodes and so this hasn't become an issue for me yet.
Also note that you can not get the exact same result running this version and the sequential version even when providing the same seeds, as in my version the "master process" advances the random number generator (using the input seed) in order to sample new seeds for the local RNGs of each daughter process.
Please
let me know if you find any issues and I will try to fix them as
quickly as possible. You can either email me, or post an "issue" on the
github page.
Tyler