GPU-enabled PyMC for big data sets

1,545 views
Skip to first unread message

Wes McKinney

unread,
Feb 20, 2011, 3:37:35 PM2/20/11
to py...@googlegroups.com
Hi all,

I just wanted to send a quick note on a topic that may be of some
interest to PyMC users.

There has been quite a bit of interest lately in using GPUs to speed
up MCMC in various ways-- the most obvious target is the computation
of log-likelihoods in MH steps, particularly when the number of
observed data points is very large. See example reference:

http://pubs.amstat.org/doi/abs/10.1198/jcgs.2010.10016

With large data sets, it should be no surprise that doing this can
dramatically reduce the runtime. We recently started working on a
(still in very early prototype stages, unstable API etc.) general C
library of CUDA functions which can be easily called in Python (I am
using Cython to do it):

https://github.com/dukestats/gpustats

The goal at the moment is to implement a variety of density functions
on the GPU. An immediate use (outside of using them in hand-coded
MCMC) would be to wire these into PyMC somehow (haven't thought too
much about this) to allow folks with GPUs to put them to work in their
PyMC code. We don't have a set timeline for this but will likely be an
ongoing project.

As a toy example I took John McFarland's code from a few days ago,
ratcheted up the size of the observed data, and monkey-patched
pymc.MvNormal to use the GPU multivariate normal pdf function
implemented in the above repo:

import pymc as pm
import numpy as np
import numpy.random as rand
import pdfs

import numpy as np
from numpy.linalg import inv
import pymc as pm

k = 5
rand.seed(1)
true_std = rand.randn(k)
true_mean = rand.randn(k)
true_cov = np.eye(k) * np.outer(true_std, true_std)
n = 10000

np.random.seed(1)
vals = np.random.multivariate_normal(true_mean, true_cov, n)

# Trying to give tau a vague prior and reasonable starting value...
tau = pm.Wishart('tau', n=k + 1, Tau=np.eye(k),
value=inv(np.eye(k)*10.0))
m = pm.Normal('m', mu=0, tau=1e-5, value=np.zeros(k), size=k)

data = pm.MvNormal('data', m, tau, value=vals, observed=True)

def logp(value=None, mu=None, tau=None):
# Call the GPU function-- which need a covariance matrix
return pdfs.mvnpdf(value, [mu], [inv(tau)]).sum()
data._logp.fun = logp

S = pm.MCMC(locals())
S.sample(iter=1000,burn=0,thin=1)

The difference in runtime speaks for itself:

Without the GPU function:

15:09 ~/code/gpustats $ time python pymc_test.py
2011-02-20 15:21:25-0500 [-] Log opened.
Sampling: 100% [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
Iterations: 1000

real 0m31.912s
user 0m32.120s
sys 0m0.120s

With:

15:21 ~/code/gpustats $ time python pymc_test.py
2011-02-20 15:22:27-0500 [-] Log opened.
Sampling: 100% [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
Iterations: 1000

real 0m3.798s
user 0m3.260s
sys 0m1.000s

So roughly 10x speedup. As the data size shrinks, eventually due to
the overhead of the CUDA calls, you will start to break even. With
1,000 observations instead of 10,000, the speedup drops to only about
1.5x on my computer.

The most fair comparison is to look directly at the speed of the logp
functions (here logp is as defined above):

In [19]: timeit pm.mv_normal_chol_like(vals, true_mean,
np.linalg.cholesky(true_cov))
10 loops, best of 3: 36.4 ms per loop

In [20]: timeit logp(vals, true_mean, true_cov)
1000 loops, best of 3: 1.43 ms per loop

In [21]: 36.4 / 1.43
Out[21]: 25.454545454545453

Or with 1e6 data points:

In [23]: timeit pm.mv_normal_chol_like(vals, true_mean,
np.linalg.cholesky(true_cov))
1 loops, best of 3: 366 ms per loop

In [24]: timeit logp(vals, true_mean, true_cov)
100 loops, best of 3: 3.47 ms per loop

In [25]: 366 / 3.47
Out[25]: 105.47550432276657

These speeds of course depend on what kind of GPU you have.

Anyway-- if anyone is interested in getting involved with this effort,
please feel free to contact me on or off the PyMC list.

Cheers,
Wes

Thomas Wiecki

unread,
Feb 20, 2011, 4:14:28 PM2/20/11
to py...@googlegroups.com
Hi Wes,

I think that approach is highly promising, thanks for sharing!

I tried something similar for my hand-coded likelihood function. I
have used pycuda (http://mathema.tician.de/software/pycuda) which made
integration of cuda code into python extremely easy, have you looked
into that?

In my case I hit a memory transfer bottleneck where each time it would
costly copy over the data from RAM to device memory, but I think it
depends on how you code the likelihood (e.g. caching would be an
option).

-Thomas

> --
> You received this message because you are subscribed to the Google Groups "PyMC" group.
> To post to this group, send email to py...@googlegroups.com.
> To unsubscribe from this group, send email to pymc+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pymc?hl=en.
>
>

Wes McKinney

unread,
Feb 20, 2011, 4:24:49 PM2/20/11
to py...@googlegroups.com
On Sun, Feb 20, 2011 at 4:14 PM, Thomas Wiecki
<thomas...@googlemail.com> wrote:
> Hi Wes,
>
> I think that approach is highly promising, thanks for sharing!
>
> I tried something similar for my hand-coded likelihood function. I
> have used pycuda (http://mathema.tician.de/software/pycuda) which made
> integration of cuda code into python extremely easy, have you looked
> into that?
>
> In my case I hit a memory transfer bottleneck where each time it would
> costly copy over the data from RAM to device memory, but I think it
> depends on how you code the likelihood (e.g. caching would be an
> option).
>
> -Thomas
>
<snip>

PyCUDA is definitely on my radar-- there is some concern about
creating a Python-only library when many Bayesians are still doing
Matlab, R, and their own C code. One really nice thing about PyCUDA is
that it enables you to manage the lifetime of the data on the GPU
global memory in Python-land which may matter on down the road-- right
now the code is "stateless", each function call is a full roundtrip of
input data and results which, as you mentioned, can be a problem.

- Wes

Anand Patil

unread,
Feb 21, 2011, 6:06:47 AM2/21/11
to py...@googlegroups.com, Wes McKinney
Hi Wes,


Great stuff. The most civilized ways to wire your example above into PyMC would be to use the decorator syntax,

@pm.stochastic(observed=True)
def data(mu=m, tau=tau, value=vals):

   # Call the GPU function-- which need a covariance matrix
   return pdfs.mvnpdf(value, [mu], [inv(tau)]).sum()


GPUMVN = stochastic_from_dist("GPUMVN", gpu_mvn_logp, gpu_mvn_random, mv=True)
data = GPUMVN("data", m, tau, observed=True)

What are your plans for https://github.com/dukestats/gpustats/blob/master/pdfs.py#L19 , is that going on the GPU also at some point?


Another advantage of PyCUDA is that it can compile kernels at runtime, so if you write your boilerplate code as a template people could conceivably write just the business part of a new likelihood, for example "x * exp(-x)", and create a GPU likelihood from it without actually writing any compiled extensions.


Keep us posted and good luck,
Anand

Flavio Coelho

unread,
Feb 21, 2011, 6:45:50 AM2/21/11
to py...@googlegroups.com
+1 for using PyCuda over straight C wrapped by Cython. Though you should also keep an eye on PyOpenCL (from the same developers of PyCuda) which can run on Both NVIDIA and ATI hardware.


Flávio Codeço Coelho
================




--

Wes McKinney

unread,
Feb 24, 2011, 5:53:06 PM2/24/11
to py...@googlegroups.com
Thank you guys for the input. I had the opportunity to create a little framework for defining density functions (and eventually other things) using PyCUDA and I'm pretty happy with it. I'll make it into a library on PyPI soon and encourage those interested to keep an eye on the GitHub project!

Though now that I see Apple has put AMD cards in their new line of laptops, perhaps I should port the code to OpenCL at some point. I'll leave it for another time.

Best,
Wes

Flavio Coelho

unread,
Feb 25, 2011, 5:58:36 AM2/25/11
to py...@googlegroups.com, Wes McKinney
Thatss great Wes,

I'll definetly test (and give some feedback) it as soon as I can get my pyCuda installation working.

best,


Flávio Codeço Coelho
================




Anand Patil

unread,
Feb 25, 2011, 6:22:16 AM2/25/11
to py...@googlegroups.com, Flavio Coelho, Wes McKinney
On Fri, Feb 25, 2011 at 10:58 AM, Flavio Coelho <fcco...@gmail.com> wrote:
Thatss great Wes,

I'll definetly test (and give some feedback) it as soon as I can get my pyCuda installation working.

best,

Flávio Codeço Coelho
================


I've had some difficulty getting all of pyCuda's dependencies installed on local machines also. An easy alternative is Justin Riley's StarCluster AMI, which is a machine image for use on 'cluster GPU' instances on Amazon's elastic compute cloud. They cost $2.10 per hour. See https://github.com/jtriley/StarCluster/issues?authenticity_token=c434011df33abe1c6b75496b53ae9fb15e630f38#issue/9.

Anand

Anand

Flavio Coelho

unread,
Feb 25, 2011, 9:22:02 AM2/25/11
to Anand Patil, py...@googlegroups.com, Wes McKinney
Cool!

thanks for the tip, Anand!



Flávio Codeço Coelho
================




Chris Fonnesbeck

unread,
Feb 25, 2011, 5:50:39 PM2/25/11
to PyMC
On Feb 25, 5:22 am, Anand Patil <anand.prabhakar.pa...@gmail.com>
wrote:
> I've had some difficulty getting all of pyCuda's dependencies installed on
> local machines also.

In contrast, I was easily able to install PyOpenCL on my Macbook Pro
via pip.

cf

Wes McKinney

unread,
Feb 25, 2011, 9:45:56 PM2/25/11
to py...@googlegroups.com

I think using PyOpenCL is an inevitability, especially given that
Apple is putting AMD GPUs in the next line of Macbook Pros. I've heard
that developing OpenCL code is a bit more difficult. CUDA can be
enough of a challenge-- I had to tinker quite a bit to get all of the
global-shared memory transactions to coalesce and to have no shared
memory bank conflicts (for those who know what that CUDA-speak means
=))

I got PyCUDA 0.94 to install out of the box on my MBP-- but you have
to a) use 32-bit Python and b) have the CUDA driver installed. Part a)
doesn't make me very happy (my target is desktops running Linux at the
moment where this is less of an issue).

- Wes

Wes McKinney

unread,
Feb 27, 2011, 1:31:29 PM2/27/11
to py...@googlegroups.com

One potentially problematic issue with OpenCL appears to be the lack
of a CURAND library equivalent. There is an MD5-based PRNG in PyOpenCL
but I'm not sure how well it compares in speed or quality of the
random number generator. For the purposes of PyMC the main benefit
will be computing densities, and this can all be done in OpenCL.

- Wes

Patric Holmvall

unread,
Apr 3, 2011, 7:59:23 AM4/3/11
to PyMC
Hey there Wes,
Sorry for digging in a month old topic. I just thought I'd tell you
that hopefully a new RNG/PRNG is going to be implemented in PyOpenCL.
I'm currently working with Monte Carlo/Metroplis algorithm
calculations on the GPU, utilizing PyOpenCL. This means that I can't
really take advantage of PyMC. Combining PyMC and PyOpenCL/PyCUDA like
you proposed would therefore be really neat. Unfortunatley as you say,
the PRNG is not that good. It is slow and has a poor quality factor.
I've been talking a bit to the package maintainer of PyOpenCL (Andreas
Klöckner) who seem to be interested in updating it, and made a request
on the PyOpenCL mailing list. You can follow the discussion here:
http://thread.gmane.org/gmane.comp.python.opencl/579

Yours sincerely,
Patric Holmvall


On Feb 27, 8:31 pm, Wes McKinney <wesmck...@gmail.com> wrote:
> On Fri, Feb 25, 2011 at 9:45 PM, Wes McKinney <wesmck...@gmail.com> wrote:

Thomas Wiecki

unread,
Apr 3, 2011, 4:12:43 PM4/3/11
to py...@googlegroups.com, Patric Holmvall
There is recent work on this topic as well from Theano
(https://bitbucket.org/jaberg/theano-curand). Though this might be too
far in the future (not sure how far the pymc-theano branch is,
Anand?), but this might provide an alternate fix for this problem.

-Thomas

Kyle Foreman

unread,
Nov 4, 2011, 4:54:57 PM11/4/11
to py...@googlegroups.com
Sorry to drag up an old topic, but I've recently begun playing around with Theano and was wondering if anyone has made any progress on using it with PyMC lately. It looks like Theano's RNG capabilities have increased quite a bit over the past few months.

John Salvatier

unread,
Nov 4, 2011, 4:58:59 PM11/4/11
to py...@googlegroups.com
I am still actively interested in this. I have built a toy package (discussed here), but it has a serious subtle flaw. I am currently in the process of learning more probability theory to figure out how to solve it correctly.

On Fri, Nov 4, 2011 at 1:54 PM, Kyle Foreman <kylef...@gmail.com> wrote:
Sorry to drag up an old topic, but I've recently begun playing around with Theano and was wondering if anyone has made any progress on using it with PyMC lately. It looks like Theano's RNG capabilities have increased quite a bit over the past few months.

--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pymc/-/AXxJRMqPwsIJ.

Thomas Wiecki

unread,
Nov 4, 2011, 5:12:11 PM11/4/11
to py...@googlegroups.com
Agreed, this should also make it trivial to sample using the GPU.

The other nice thing is that Theano can do automatic differentiation
(e.g. of one's own likelihood) so that might enable automatic use of
gradient samplers such as Hamiltonian.

Thomas Wiecki

unread,
Nov 5, 2011, 10:17:56 PM11/5/11
to py...@googlegroups.com
Indeed, here is a HMC sampler in Theano that makes use of automatic
differentiation:
http://www.deeplearning.net/tutorial/hmc.html#hmc

Kyle Foreman

unread,
Jan 27, 2012, 9:07:26 AM1/27/12
to py...@googlegroups.com
Here's another attempt at using Theano as the basis for an MCMC platform: https://github.com/jaberg/MonteTheano

It's got Metropolis Hastings working, and HMC seems to be under development. It's also got quite a few probability distributions written up, many of which I hadn't seen Theano implementations of before.
Reply all
Reply to author
Forward
0 new messages